hbase-2437 log split code revamp by Cosmin

Review Request #74 - Created May 21, 2010 and submitted

This is a version of Cosmin's patch that applies to trunk doctored to work w/ hadoop 0.20.
All his new tests past.
  1. I'm for committing this (I'm sick of looking at it)
    1. Working on it. Some of the stuff might need more input
  1. Looks pretty good. If possible it would be awesome to move all of these log splitting methods into a new class, but we can do that in a followup.
    1. Let's have a separate issue for that
  2. why not make these Class<? extends HLog.Reader>?
    1. done - for both reader and writer
  3. can we use conf.getClass here? That way we'd have the appropriate generic types
    1. done - both reader and writer impl
  4. having looked at this function a few times and been very confused, I'd love to see a javadoc that explains the entire process - eg what the "batching" does, what the output looks like, etc
    1. the original version of this function determined me to start refactoring in the first place. I'll add the description but if it's still confusing it might need more work.
    2. Oh, it's much much improved! Thanks! I still think a "high level overview" would be good to see.
    3. Sure. I've already included the overview. Let's refine it if necessary. 
  5. this conf is incorrectly named, right? it's not a number of threads, but rather a number of logs to read in in each round?
    1. perhaps hbase.regionserver.hlog.splitlog.batch.size?
    2. why .regionserver.? I'd say just hbase.hlog.split.batch.size or something
    3. done
  6. this conf is very vaguely named
    1. I know and tried to get a better name when created it. Can you suggest something better? I can't figure a short descriptive enough name
    2. if this applies only to hlog splitting, maybe hbase.hlog.split.skip.errors
    3. I think this should be a global setting as it will probably involve behavior in other places as well. 
      Changed it to hbase.hlog.split.skip.errors for now
  7. should add e as a second parameter, so you get the full stack trace in this warning
  8. typo, and we should fix this before committing. we can use the org.apache.hadoop.util.VersionInfo class for this, or use reflection to look for the "hflush()" function
    1. I'd rather have these differences dealt at the lowest level (writers) and abstracted than spread across code.
      What do you think? 
    2. Which do you mean by "writers" here? I'd support factoring this function out into an HdfsUtil class somewhere.
    3. By "writers" I mean the classes that implement the Writer interface (sync, append, etc). There's a HLog.Writer interface and the implementation is SequenceFileLogWriter that encapsulates a SequenceFile.Writer 
      We should have a common interface for HBase classes that append, sync, etc. 
      If the filesystem doesn't support append - that should be taken care in the Writer implementation that actually talks with HDFS. i.e. do it in a single class and avoid duplicating the appendSupport logic through code.
      For now I've:
      - changed it to look for hflush in FSDataOutputStream.
      - renamed isAppend to isAppendSupported, 
      - renamed recoverLog to recoverFileLease 
      - moved both to hbase.util.FSUtils 
  9. we should use a namingthreadfactory here. Consider:
    (I have some patches coming in which use guava, it's very handy)
    1. done 
      I'll need help setting guava as a maven dependency.
  10. we're calling execute(Thread) whereas execute takes a Runnable. By some quirk of java, Thread extends Runnable, but this is somewhat strange - we should make createNewSplitThread be createNewSplitter or createNewSplitRunnable
    1. createNewSplitter is fine.
      It's a Callable, changed to submit in order to check for the result in case one of the writers failed (see comment below)
  11. if j > 30 (or something), escalate log to info level. Also preferable to log the number of seconds, not the iteration.
  12. yes, I think so. The RS could have crashed right after opening but before writing any data, and if the master failed to recover that, then we'd never recover that region. I say ignore with a WARN
    1. more aspects here:
      I think the reported size will be >0 after recover, even if file has no records. I was asking if we should add logic to check if it's the last log. 
      EOF for non zero length, non zero records file means file is corrupted. 
    2. I agree if it has no records (I think - do we syncfs after writing the sequencefile header?). But there's the case where inside SequenceFile we call create, but never actually write any bytes. This is still worth recovering.
      In general I think a corrupt tail means we should drop that record (incomplete written record) but not shut down. This is only true if it's the tail record, though.
    3.  - How can we determine it's the tail record or the 5th out of 10 records that's broken? We just get an EOF when calling next()
       - Currently we ignore empty files. Is it ok to ignore an empty log file if it's not the last one?
       - I'm not sure whether it's possible to get an EOF when acquiring the reader for a file after it has been recoverFileLease()-ed. So the whole try/catch for HLog.getReader might be redundant.
      When reading log entries we currently don't catch. We read as much as we can and then let any exception bubble up. splitLog logic decides what to do next: If we get to a broken record it will most probably throw an EOF in there and based on skip.errors setting it will act accordingly. There will be no EOF if there are no records, though and we continue.
      There are two possible reasons for a file being corrupted/empty:
      1 HRegion died => only the last log entry (edit) in the last log in the directory should be affected => we could continue but are we sure it's the tail record?
      2 Another component screwed things up (bug) => other logs than the last one could be affected => we should halt in this situation.
  13. see above logic - writer could have crashed after writing only part of the sequencefile header, etc, so we should just warn and continue
  14. I think we need to handle EOF specially here too, though OK to leave this for another JIRA. IIRC one of the FB guys opened this already
    1. what's the other JIRA? see my above comments.
    2. Can't find it now... does my above comment make sense?
  15. this thread name is not very descriptive
  16. wouldn't it make more sense to just add them in the other order? LinkedList is doubly linked, so adding to the beginning or end are both constant time, and then we wouldn't have to do this contortion to iterate backwards here. 
  17. we can move this out of the inner loop, right?
  18. add comment about what this is doing - presumably the point here is that some previous master may have started splitting the logs, then crashed in the middle.
  19. no point to call .sync() here, it just wastes a bunch of IO to write "sync markers" which we don't make any real use of.
    1. sync() used to call syncFs(). It looks like HBASE-2544 changed things a bit, but it doesn't only add the SequenceFile sync marker.
      I added this after I've seen inconsistent results when running splitLog on bigger hlogs. Try copying a log from the cluster locally and run splitLog from the command line a few times without flushing it after each append. I used to get inconsistent results between runs and calling sync fixed it.
      There's this "//TODO: test the split of a large (lots of regions > 500 file). In my tests it seems without hflush"  in the TestHLogSplit. 
      We could do some testing to figure out why would log entries be lost when running locally.
      What would be a better way to flush the writer?
    2. This seems really voodoo.. if anything we're probably masking a real bug by doing this. Can you write a unit test which shows this problem (even if it takes 30 minutes to run, would be good to have in our arsenal)
    3. I can't reproduce it on hdfs-0.20. I can't compile hdfs-0.21 (again) for some reason. I'll give it another try some other time. 
      Added the test. Also tried with a real 60MB log file. 
      I'm not sure if we should leave the test active.
  20. I'd think we need to fail here. This ties into the comment above about getting Futures out of the threadpool and checking any uncaught exceptions
    1. my previous comment got lost somehow.
      Todd suggested submitting a Callable<Void> to executor thread.
      I wonder if we could use getCompletedTaskCount. Documentation sais it's an estimation, however it's an estimation only during runtime and seems to be correct after shutdown finishes (I looked in the source as well)
      Another option would be ExecutorCompletionService which seems to be suited for this kind of job.
    2. correction: Todd suggested submitting a Callable<Void> to executor thread and then do a Future.get() and catch. 
    3. getCompletedTaskCount won't work
      I used Todd's approach and caught ExecutionException to determine if a write has failed.
      Injected another Writer implementation (one that can be triggered to throw while writing).
  21. the .exists checks are redundant, since HDFS mkdirs acts like mkdir -p, we're just wasting RPCs
  22. this log message has never been english
    1. I don't know what it's supposed to mean either :)
    2. changed it
  23. this is the case I take issue with - trailing garbage should still proceed (with warnings) even if we've told it not to skip errors.
    1. I'd like to be able to investigate the trailing garbage. I don't think this should ever happen (do you see any scenarios?). If it did we might lose data. We used to fix NameNode edits for fsImage by adding a missing byte to a corrupted entry.
      I'd like to reflect more on this, maybe see other opinions. 
    2. The case where this happens is if you crash in the middle of appending a long edit. Consider the case where a single edit might have 1MB of data (large rows). We can easily crash in the middle of transferring it, before we call sync on the edit. In this case, the client never received an "ack" for the write, so we can feel free to throw it away (this isn't data loss, it's correct operation).