FLUME-195: Allow custom Flume OutputFormats via a plugin interface

Review Request #773 - Created Sept. 2, 2010 and submitted

Eric Sammer
HenryR, jon, phunt
Initial implementation of FLUME-195.

- This adds a (static) configuration parameter flume.outputformat.plugin.classes that allows users to specify a comma separated list of plugin classes to load.
- Plugin classes are loaded at FlumeNode start up time. (Changes requires a node restart due to the nastiness of class unloading.)
- OutputFormat abstract class now mandates a getName() implementation that returns the user string (i.e. raw, avrojson, ...).
- All existing output formats retrofitted for compatibility.
- Two methods for registering output builders added to FormatFactory.
- Updated TestFormatFactory to be JUnit4 annotation style.
- Added a test case for registering a custom format.

Outstanding / Missing:

- Tests for class loading. I have something but it's flaky.
- Plugin registration in FlumeNode should probably be pushed into a method rather than being inline where it is.
- One inconsistency in error logging during plugin registration: e.printStackTrace() is called and shouldn't be (IDE generated, I missed it).
- I'm not happy with the duplicated naming of OutputFormats and OutputFormatBuilders, but I did this to minimize the code impact.
- Test cases added.
- All tests run.
  1. Eric,
    There is a synchronization problem that needs to be resolved.
    The rest are nits.
    Do you have plans for more formats?
  2. I think I like flume.plugin.outputformat.classes more.  opinion?
  3. src/java/com/cloudera/flume/handlers/text/FormatFactory.java (Diff revision 2)
    nit: These feel like they should be XxxFormat.getName() as opposed to XxxFormat.builder().getName().
    We are instantiating builder multiple times!
    1. This is part of a larger issue. The problem is that we need a map of builders which create formats but formats are what are referenced by the names. I'm looking at this and I think it can be fixed but if it's OK with other reviewers, I'd like to take it as a separate issue.
  4. here the map is accessed unsynchronized
    1. I think I got all of these now. Thanks!
  5. but here it is synchronized.  
    Synchronization is inconsistent.  Either guard this all the time, or never guard it.
  6. And synchronized here too.
  7. and unsynchronized here again.
  8. I like the tests.
  9. src/javatest/com/cloudera/flume/agent/TestFlumeNode.java (Diff revision 2)
    maybe a separate test?
  10. Leave a comment saying that these plugin class will be present in subsequent tests using the defauly Factory
    Reset the factory to default. (may be with @After function)
    1. Added support for FormatFactory#unregisterFormat(String) and called from the test.
  1. I think there is still a race possiblity.  The other is just a nit.
  2. src/java/com/cloudera/flume/handlers/text/FormatFactory.java (Diff revisions 2 - 3)
    This should return a copy of the values instead of a reference, or warn very strongly about returning a reference.  
    I prefer #1 because this could cause a race -- since this is lock guarded, someone could get a reference and do thread unsafe stuff.  
    1. Fixed to use Collections.unmodifiableCollection(registeredFormats.values())
  3. src/java/com/cloudera/flume/agent/FlumeNode.java (Diff revision 3)
    Nit: These probably could be consolidated into one:
    } catch (Exception e) {
      LOG.warn("Unable to load output format plugin class " + className +  " - " + e.getMessage()); 
    1. I opted to leave this because I don't want to catch things like RuntimeExceptions and friends (even though it's ugly).
Review request changed
  1. Fix this and I think it is good.
  2. This is really rare, but I got dinged by this before -- unmodifiable is a reference to the underlaying list and can still race (read while something is writing and being inconsistent).
    I think you need to make a copy -- something like:
    new ArrayList(registeredFormats.values())