FLUME-320 Race condition in new NIO based TailSource

Review Request #1433 - Created Jan. 9, 2011 and submitted

Alex Baranau
FLUME-320 Race condition in new NIO based TailSource.

This is initial version of patch and is "highly open" for discussion. This patch fixed issues I had on my setup, but I couldn't test it against script at FLUME-320 yet, because jira attachments were lost.

  1. Hi Alex,
    It really has a lot of changes. And in my opinion, it looks clearer and neater.
    Still, I try to give another try based on your work. :)
    The following snippet is some pseudo-codes about tailBody method.
    channelSize = in.size()
    madeProgress = readAllFromChannel()
    if madeProgress then
        lastFileLen = file.len()
        lastFileMod = file.lastModified()
        return true
    flen = file.len()
    fmod = file.lastModified()
    if flen == lastFileLen and fmod == lastFileMod then
        there is no change, return false
    if in.size() == channelSize then
    return false
  1. Another note: it makes sense to notice simple file rotation early and finish reading of old file in new (separate) thread while returning quickly to start reading from new file.
    I had this case: log4j outputting 10M (max) files and at some point due to some reason several files of 10M were filled. Flume's tails missed some of them since they were quickly rolled while processing was not as quick (I wrote data from file to HBase which has lower write performance than simple output of logs on fs).
    Btw, with first diff I submitted we can easily add this, while some more thinking required with the second diff.
    1. Hi Alex,
      Yes, you are right. The pseudo-codes still fail to tackle the race. Thanks for pointing out this.
      As for the note about noticing the file rotation early, I doubt if this is really helpful. It is hard to define "early". Probably 10ms is early enough for 10MB files, but for 10KB files, it might be too late.
      Even with help of inode number and the new WatchService APIs in JDK 7(it still depends on the underlying OS implementation), we probably cannot get 100% correct solution either.
      BTW, tailSourceDir might helps if the rotated files are in the same directory.
  1. I'm was search for a local copy of that script, but couldn't find.  will rewrite (need it for pre-release end-to-end testing testing as well) 
    1. If you guys are on trunk after FLUME-430 commit (hash dcb39e) there are some useful scripts for generating data in the end to end testing dir:
      So if you run flume with this command line (run a node called node, tailing file foo and writing to file bar in avrojson format)
      $ flume node -n node -1 -c 'node:tail("foo")|text("bar","avrojson");'
      In another shell 
      $ count-forever .010 > foo
      ^C (After a while)
      $ count-forever-verify < bar
      This should give a summary of which events have duplicates or have discontinuities.
    2. Ran test comparing old version with new version.  In initial test, new version seems much more solid than the old.  Sweet!  
      Will report more after letting run for a hour or so.
  1. Hey guys, 
    I buy it and I tested it enough to believe that this is better than the previous.  Please fix the nits and post patch for final review and on jira! 
  2. nit: reformat so that comment starts on next line.
  3. nit: remove comment
  4. nit: new comment doesn't quite make sense.
Review request changed

Change Summary:

Thanks for running tests, Jon. Had too busy weeks to test myself. Will do later in case they are still show race issues (though all further improvements would go in separate JIRA issue).

Fixed nits, updated diff.

Wanted to know Jon's opinion on reading file to the end in separate thread in case of rotation detection. Thanks!


Revision 3 (+120 -123)

Show changes