FLUME-81 Support for 1 RegExp extractor that populates multi event attributes

Review Request #398 - Created July 28, 2010 and submitted

Nick Verbeck
Adds a new Extractor method to provide 1 RegExp that each sub pattern can be extracted and assigned to an attribute/field.
Internal runs against sampled Apache Log files in a stand alone flume setup.
  1. Hey Nick - this looks great, and functionality I've really wanted to have in Flume for some time. 
    There are a few nits below, and it would also be great to see tests before we think about committing it. Let me know if you have any questions!
  2. No need to split on every append(..) - maybe do this in the builder?
  3. Please use spaces, not tabs, and two spaces per indent.
  4. Why would this throw? I think m.group(..) should return null if nothing was captured for a particular group, right?
    1. m.group() does through an IndexOutOfBoundsException If there is no capturing group in the pattern with the given index.
  5. I think it's better to sanity check that n.length >= m.groupCount() at the start of the method.
    1. The idea behind why I did this is to allow users to give more/less names then they want from there regexp. Since there are occasions when a sub pattern is used but the data from it you may not wish to store.
  6. ... and if so I think you can lose this catch.
  7. You don't have to encode names as a string to split - since argv is variable length you can just iterate over it and pull out all the names. 
    The patch here: https://issues.cloudera.org/browse/FLUME-90
    has an example of doing exactly this in the builder. 
  1. Great addition, thanks!
    Some comments. Also, add documentation to src/docs/UserGuide/Appendix
  2. maybe 
    regexAll(regex, name [, name]*)
    to indicate multiple names allowed?
  3. try some corner cases?
  2. nit: I prefer: 
    class RegexAllExtractor<S extends EventSink> extends EventSink<S>
    1. Not sure on this since EventSinkDecorator already doing this just in a modified way that doesn't through errors about EventSink not being a generic.
      public class EventSinkDecorator<S extends EventSink> extends EventSink.Base
      As well as it would break due to EventSinkDecorator is a class where EventSink is just an interface. As well as both RegexExtractor and SplitExtractor do as currently implemented by RegexAllExtractor.
    2. I'm ok with it either way right now.  
  3. name[, name] 
    should be 
    "attr"[, "attr"]
  1. otherwise, lgtm.
  1. There few spacing nits, but the bigger issue what happened to the docs updates?
    Get the docs in and then this looks good to me.
  2. src/java/com/cloudera/flume/core/extractors/RegexAllExtractor.java (Diff revision 5)
    ideally, fix spaces.
  3. these are probably tabs instead of spaces.
Review request changed
  1. lgtm.