FLUME-410 Add ability to tail subdirs to tailDir sink

Review Request #1313 — Created Dec. 17, 2010 and submitted

Add additional parameter to tailDir source to make it recurse into subdirectories for watching/tailing files. Ideally add parameter which defines max depth of recursing.
test performed for tailDir with param recurseDepth=2, in test:
* checked tailing existing subdirs
* adding subdirs checked
* checked max recurseDepth
  1. Alex,
    Nice feature.  I have a few nits, some doc improvements, some questions about unused code (fireDirCreated/Deleted).  
  2. src/docs/UserGuide/Appendix (Diff revision 2)
    Wording is unclear.  Do you mean unlimited depth?  
    I think this should be some limited (but potentially large number).  What happens if we get into some kind endless of link recursion?
    Also, doc should make it clear that the regex is for file names (not including dirs), that the regex has no affect on the dirs, and that ALL directories are recursed into.
  3. style nit: prefer return here so a reader doesn't have to search for the return on the other side of the else.
    Also, if done there is no need for else context -- we've already ruled out that particular case.
  4. grammer. "about deleted files the handlers," ?
  5. This is the case where the dir is no longer present right?  (comment at this level would be nice)
  6. src/java/com/cloudera/util/dirwatcher/DirWatcher.java (Diff revision 2)
    I need to think about this more.
  7. src/java/com/cloudera/util/dirwatcher/DirWatcher.java (Diff revision 2)
    Confused -- are these used anywhere?
  8. would be nice to have a phases where you re-add a previously deleted dir and show that the new stuff gets pulled in.
  1. Thank you for taking look at this!
  2. src/docs/UserGuide/Appendix (Diff revision 2)
    Yep, I meant unlimited. 
    I'd suggest to ask for positive number from user. I'll add validation.
    Will adjust the rest of the text.
    1. sorry -- I was just pointing out the part of the sentence that didn't make sense me.  I think it still should be reworded.  I think you mean "this check will notify the handler about deletec files, so..."
  3. src/java/com/cloudera/util/dirwatcher/DirWatcher.java (Diff revision 2)
    Perhaps we should add extra param for filtering dirs on regex?
    Or, we can filter file (incl. dir) names based on full path. E.g. user could use "subdir[12](/.*.log)?" to tail subdir1/*.log and subdir2/*.log
    1. I'd make that would be an add on feature (new jira).  For now I think what exists is good enough.
  4. src/java/com/cloudera/util/dirwatcher/DirWatcher.java (Diff revision 2)
    sorry, this should be removed
  1. Alex,
    Doc update triggered another thought, input validation update on recurseDepth.  
    Adding a dir attribute thing sounds useful.  Create a new jira and make a new patch?  (this one is soo close!)  My suggestion is to use a kwarg to make recording the dir a open, and to give the user option to pick attribute name.
    So something like this:
    tailDir("/dir",".*", 0, dirAttr="tailfiledir", fileAttr="tailfilename")
    So a file /dir/file would produce events with metadata like:
    event { taildirfile="/dir", tailfilename="file" }
    1. added: https://issues.cloudera.org//browse/flume-431
  2. The updated docs made me notice this:
    change to
    if (recurseDepth <= 0)
  3. still doesn't quite make sense.
  4. or, make sure the value is >=0 (actually, doing both this and the first nit would be best)
Review request changed
  1. looks good to me!