FLUME-92 Extractor to handle extracting of a date's components from an event field

Review Request #402 - Created July 28, 2010 and submitted

Nick Verbeck
Added support for a Date Extractor
Test Case created
  1. Some specific comments on the code but in general seems fine except for the semantics of append.
    Also add documentation to src/docs/UserGuide/Appendix for this extractor (along with regex/split)
  2. "date" is not a good name, too general.
    "datefieldex", "dateextractor" or "dateex" seem better. "dateparse"?
    maybe we should come up with some convention such as ex<foo> - exDate, exPostal, exTime, exIP, etc...?
    1. The exDate convention sounds like a good idea to me. Might help to make it a little easier to read complex flows. 
    2. Do you want to go ahead and make this the standard, and I'll put as part of this a change over for regex to exRegex, and split to exSplit. With backwards support as well. Just change the docs and put a copy of said defs in the SinkFactoryImpl. 
    3. I'd prefer keeping the existing names for split and regex for now -- we can revisit the naming when we get closer to a major release version.
      It feels like split and regex are more generic than what/how I think the exDate, exTime, exIP examples would work.
  3. naming convention
  4. fields protected at least?
  5. minor nits, but spcs btw f( and ){
  6. I don't think you want to throw an exception here - it will stop the driver from pushing to the sink (notice regex/split don't do this). If the event doesn't contain attr I think you just want to return (ie is that really an error condition?)
  7. indentation problems on these lines
  8. similar to previous comment on attr does not exists - not really an ioexception, maybe a warning? debug?
  9. why not use attributes.setint?
    1. I did originally but in the processes of trying to work around a limitation I found with the Attributes class. What turned out to be a problem with _ in the field names. I switched it to String as a check to see if was the problem I was noticing. Bug is documented as FLUME-93. I can switch those back to make it more uniform or do we still wish to use Attributes as I've noticed a comment somewhere where you guys are wishing to remove usage of it.
    2. I think that's for reporting... Jon might have more insight/preference.
    3. Using strings for now is ok -- we may revisit this in the future.  Attributes.* is something I want to replace/deprecate but haven't been able to yet.  (also partially why this is 0.9.x instead of 1.0.x)
  10. bad usage string
    1. Do you guys not use [] to denote optional fields? or is something else wrong?
    2. thats fine - I should have been more explicit, you are documenting "date" (whatever it's called now), not "regex". ;-)
    3. Ah didn't even notice. Thats what I get for copy+paste.
  1. Cool! Docs and tests!
    I have a few questions about some places where docs/comments have undefined behavior.
  2. src/docs/UserGuide/Appendix (Diff revision 3)
    Should ideally have link to the docs/javadoc for the pattern syntax.
    maybe "prefix" instead of "prepender"
    1. I'll change it in the docs as well as in the code just to be consistent.
  3. does this create a new instance or reuse an instance?  Might have some thread-safety issues?
    1. It returns a new instance from the way the docs sound. http://download.oracle.com/javase/1.4.2/docs/api/java/util/Calendar.html#getInstance()
  4. Not sure this is the right behavior if there is no date string found.  I think it should pass the event through -- e.g. it should call super.append(event) and then return.
    I think it would make sense to have a separate "filter" decorator that could reject events that didn't have an attributes or if the attribute's value didn't match some value.
    1. That could be a useful filter, as well as thinking about it a regex based filter. That if the event body doesn't match the regex its out. 
  5. What happens if I have a body that has multiple dates? gets the first? gets the last?  (It is fine to document this as a limitation)
    What happens if the line doesn't have a date in it?
    Test cases to document this behavior please!
    1. It doesn't read the date from the Body. Take a look at line 91. It requires that you extract the date using split/regex/regexAll or some other method. Allowing you to parse as many dates as you want from the event. Thats what the 1st param is all about. 
  1. lgtm.
  1. Hi Nick,
    I'm trying to apply the patch, and would like to have it so that you get author credits.  My guess is that you generated the patch via 'git diff'.  I looked at your repo and see that there are several small patches on you FLUME-92 branch.  Could you rebase this, or consolidate all the small patches into one, and then use 'git format-patch' to generate a new patch?  
    I should be on IRC during the week if you want help!
  1. Nick,
    I''m kind of treating this as a new review since I haven't seen it for a while.  Most of the comments are just cosmetic nits or clarifications.  
  2. src/docs/UserGuide/Appendix (Diff revision 6)
    typo? "data string"?
  3. src/docs/UserGuide/Appendix (Diff revision 6)
    prepender?  maybe 'using "date" as a prefix if no _prefix_ is provided'?
  4. src/docs/UserGuide/Appendix (Diff revision 6)
    typo: "to extracted it"
  5. Wrong Logger, this has been updated.
    Actually, not used at all in this code..
  6. maybe some comment about behavior when date attribute not present.
  7. move to after the next check, right before line 77?
  8. do you think this common place or rare?
    maybe a debug log message?
  9. Make usage consistent with others:
    usage: exDate(field, pattern[, prefix="prefix"])
  10. src/javatest/com/cloudera/flume/core/extractors/TestExtractors.java (Diff revision 6)
    Why 'new String("##")'?
    don't we just need "##"?
  1. lgtm.
  1. Getting a test failure when I run this test.  Can you confirm that you get this problem as well?  
    My guess is that July is daylight time (EDT), and currently it is standard time (EST) so it takes into account the hour shift.  
    Can you confirm?  Maybe add another date in standard time?
    If this is the case please add comments about why it seems "off by one".
    1. I can't reproduce this. I end up with 3 successful test when I run this TestCases
          [junit] Running com.cloudera.flume.core.extractors.TestExtractors
          [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 0.344 sec
  2. This line returns "10" when I run the test!
  3. This line returns "10" when I run the test!
  4. This line returns "10" when I run the test!
  2. Should this be exDate?
  3. Should this be exDate?
  4. Whitespace at EOL here and in various other places.
    1. I do not see what is wrong here. Its a new line and an empty line. It should have whitespace to care the tabbing down.
  5. weather -> whether
  6. Why create this on each call to append? This can be per-thread or just make append synchronized and share a single instance of this SimpleDateFormat.
  7. Formatting not consistent with the rest of Flume.
    1. Please explain what is wrong.
    2. Espace spaces inside (), no space outside (). Should be: if (dateStr.length() == 0) {
      That's the same sort of thing for most of the formatting comments below as well.
  8. Formatting not consistent with the rest of Flume.
    1. Again Please explain what is wrong.
    2. Missing spaces around +
  9. This does the work once ... and then if padding is set, does it again which seems like wasted time.
    1. Its done twice because you cant guarantee Type for lines 117-122. Event with an else statement Eclipse seems to complain about it.
  10. Formatting not consistent with the rest of Flume.
    1. This doesn't help. What is wrong here? From looking at other code this looks consistent to me.
  11. Formatting not consistent with the rest of Flume. (no spaces around '+')
  12. This usage string seems to indicate that it is using keyword args ... but that doesn't seem to be what the code is doing?
    1. I agree but thats how I was asked to format it in a previews review.
  13. Formatting not consistent with the rest of Flume.
    1. Again Please explain what is wrong.
  14. Formatting not consistent with the rest of Flume.
    1. Again Please explain what is wrong.
Review request changed
  1. intersting.  seems like test passes now.  lgtm.
  1. Hey Nick, 
    I figured out why the code failed to pass test -- I commented on jira about it but this is likely a better place.  Since I marked it ship it (before I fully tested it) I'll fix it this time.  See comment for info on the code I added.
  2. Add these two lines to your code here, and it should be good to go.
    TimeZone tz = TimeZone.getTimeZone("America/Denver");