FLUME-286: DFO mode does not detect network failure

Review Request #1162 - Created Nov. 3, 2010 and submitted

Jonathan Hsieh
old-flume
flume
esammer, phunt
Previously, the thrift rpc clients defaulted to using infinity (0) for a timeout value.   This changes the timeout value to be a configurable value that defaults to 10 seconds (5 seconds would disconnect heartbeating nodes after every heartbeat!)
Previous thrift rpc related tests continue to work.

Manually tested that thrift rpc for dfo seem to recovers properly, thrift heartbeats recover properly. 

Details: two physical machines, one with master+collector node , one with agent node.  Start agent sending data (console via agentDFOSink), disconnect ethernet wire on master.  notice that agent is still available after timeout and writes data to disk.  reconnect wire, notice that after retry timeout, dfo disk logs get set to collector.

I think I could automate this test, but it would require linux and root access to use firewall to simulate network partition failures.
  1. lgtm, seems like this should be documented though, no? We should encourage ppl to include doc updates in their patches.
    1. added to flume-conf.xml.
  2. is this documented somewhere?
    1. will add jira do to this.
  3. documented?
    1. Now in flume-conf.xml.  The larger task of pulling all the config values back into documentation i believe has a jira already, and is in my mind easier done in one swath.
  4. document default, should these be constants to enable easy grepping? (future?) "grep DEFAULT_ *.java" (something like that)
  5. 
      
Loading...