FLUME-685: Flume requires restart after HBase goes down

Review Request #1847 - Created June 27, 2011 and updated

Alex Baranau
* added retry after put fails with IOException.
* did small refactoring along the way (related to the changes) extracted base class for hbase sinks with common functionality: open/close sink, writing with retries, etc.

  2. Instead of hard-coding this value, I think the Abstract sink should have a default RETRY_INTERVAL_ON_WRITE_FAIL  and an overloaded constructor for this parameter.
  3. Does table.put() blocks for connection ?
    If not, we need to set dataWritten = false here.
  4. We could end up passing uninitialized writebuffer and writetoWAL values ans super class does not support default values
  1. I like the refector.
    I'm not sure if we need the retry mechanism in here.  We've tried to make the retry stuff part of flume -- so that if a sink throws an exception, the system can handle retry policies.  In this case, we probably could use the retry decorators (or wrap the hbase/attr2hbase sinks in the collector() {} wrapper sink) to handle retries available in the 0.9.3+ versions of flume.
  2. does this actually connect to hbase or is opening lazy and happen in the put call?
    (I don't see it in the hbase javadocs)
    1. as far as I remember I usually notice zk and other connection related logs as soon as the HTable is created. So I guess no lazy behaviour here.
  3. Have you tried wrapping this in a collector? 
    So instead of just having the sink be 
    hbase (blah blah)
    it would be:
    collector(500000) { hbase( blah blah) }?
    The collector piece handles retries and such if the IOExceptions are torwn when the append call is made.
    1. haven't tried, but I will. Thanks for a pointer!
  4. we might be able to add more range checking on writeBuffer, and maybe precondition checks on some of the other nullable args.
  1. So, there's another way for fixing temp connectivity issues affect (sink needs to be restarted). So options are:
    * use retry (collector) decorator -- needs to be tested
    * adjust the hbase settings (in xml) to make it retry "harder"
    * add code for retrying in hbase sink
    While the first two options look like the better approaches and more natural, the third one sets default behavior of the sink to what is expected by the user. My thought is that iff it is ok to ask user to always (well, if it is needed) use extra steps/configs while using hbase sink since default behavior isn't that good, then we should go with either of the first two options (i.e. just add it to a guide/javadocs).
    Anyways, I'd leave & commit the refactor of the HBase sink classes (i.e. extracting base one). May be to distinguish the refactoring from this retrying issue it makes sense to create separate issue with patch just for refactoring?
    1. Let's do two things -- let's make the refactor one patch and the the retry logic the other.  I'm starting to come around to the idea of putting more smarts into each sink.  I'll commit both.  (Having them separate makes it easier to choose later).
  2. Right, I can use return and while(true) instead if it is more readable.
    1. its a nit but if you don't mind that would be great.
  3. table.put() should be executed once. As soon as dataWritten = true the cycle exits. So it is always false in catch() block. Anyhow, this Q will go away as soon as I change the code to "return" from inside the cycle.
    1. good point.  I think was seeing an issue that wasn't an issue.
  1. As per discussion, created separate patch with only refactoring part: https://review.cloudera.org/r/1900/
Review request changed

Change Summary:

Updated diff based on https://review.cloudera.org/r/1900/. Fixed nit mentioned before


Revision 2 (+14 -2)

Show changes

  2. Was curious what happens when interrupted, saw that hbase client converts into IOException.  (HConnectionManager.processBatchOfPuts in HBase 0.90.4) 
    This is actually troublesome because we'll never be able to cleanup / reconfigure this logical node if the put gets into an error state! 
    Unfortunately, this might require changing HBase.  
    Also, using Flume's native recovery would suffer from the same problem.
    Can you add this caveat to the docs?  (Also can you test to verify this?)
  2. A somewhat related bit: The more you want to get away with exceptions, the more you move closer to Asynchronous HBase Client. It is used in Flume-OpenTSDB sink too. Its benefits are here: http://groups.google.com/group/opentsdb/browse_thread/thread/c952c5b2a057ae3a?pli=1