FLUME 29 - Compression using gzip before writing file to HDFS

Review Request #377 - Created July 23, 2010 and discarded

HenryR, jon, phunt
FLUME-29 adding compression interface before writing files to HDFS
  1. Cool.  I'm pretty convinced this works but we need a few things before we can can commit it.  
    1) We need test cases.  Look at TestCustomEscapedOutputDfs or TestDFSWrite for reasonable templates.  A good test would show that data is being compressed.  A better test would check to make sure the contents once uncompressed is the same.
    2) There are spacing issues -- you can see the red marks.  make sure to set your spacing to 2 and only use spaces.  a ctrl-shift-f in eclipse (if setup properly) may do all this work for you.
    3) The rest of the comments are mostly just idioms or things to be consistent with the rest of the code base.  Questions usually are asking for justification or a change, statements are usually things that need to be changed.
    1. 1 - added test case that writes gzip file, reads it back and assertEquals for data.
      2 - fixed, using 2 spaces and replaced tab with space
      3 - see below for relevant comments
  2. conf/flume-conf.xml (Diff revision 1)
    to be consistent, use true/false instead of yes/no?
  3. use getBoolean and use true/false instead of yes/no
  4. I think we can get away with only having one variable here -- can  FSDataOutputStream be changed to be OutputStream?  
    If possible, this means we could avoid having a separate gzOStm field.
    1. i had tried that approach before, the gzip compression ended up writing 0 byte sized files.... will revisit this at later stage
  5. unused?  please remove.
  6. move into open call?
  7. move into open?
  8. can this be shared outside the compression's if?
  9. Not sure writer.close is required if its encapsulating compressor version does close.  Is this necessary?  
    Does it produce warning/exceptions?
    1. its needed if compression is set to false. 
Review request changed
  2. Putting one's own copyright notice as done in this latest patch is against our policy (sort of like how we don't allow @author tags in the javadoc). One problem is that then everyone who patches the file is tempted to add a copyright for their contributions, this gets ugly fast. We don't try to fully document the reality of the copyright ownership in the source however the commit log / scm will document it, though. That's one of the reasons why we ask contributors to use git's "format-patch", as it maintains this information. Please remove this.
  1. Just a few nits, and then I think this looks good!
  2. lowercase boolean would do.
  3. phunt will comment on this.
  4. ok, you can stay
  5. inside of open.
  6. high level comment would be nice. 
    1. comments, now come on :-)
  7. Needs to delete tmp file/dir ( FileUtil.rmr(f); )