FLUME-706: Flume nodes launch duplicate logical nodes

Review Request #1896 - Created Aug. 10, 2011 and submitted

Eric Sammer
old-flume
FLUME-706
flume
eerpini, jon, nerdynick
This patch should prevent the "double spawn" case where Flume creates a logical node and immediately attempts to reconfigure it. The reason this is bad is because if the node grabs resources (e.g. a port in the case of the thrift RPC source), the subsequent spawn can run into errors that prevent both instances from working properly (due to some bad error handling).

What I've done:
* Refactored the DirectDriver <--> PumperThread relationship to be unidirectional. PumperThread is now a static inner class rather than just an inner class and state is explicitly exposed to it from the driver. The driver no longer holds certain bits of state and always fetches it directly from the PumperThread. This *should* be a net simplification, but it's still complicated.
* LivenessManager used to perform two master RPCs in two different places: once to spawn newly discovered nodes and another to update a node's configuration. Due to some hinkiness, the second operation would always decide to reconfigure the newly spawned node (which was the root cause of the issue). This should no longer be true. All spawn / config updates are now centralized and triggered asychronously (which is not entirely different, but different enough to call out).
* LogicalNodeManager no longer attempts to micro manage LogicalNode state (i.e. no longer invokes LogicalNode#loadNodeDriver()). This is encapsulated by the LogicalNode itself.

I'm not entirely sure this covers all corner cases. In fact, I'm pretty sure it doesn't. Given the criticality of this code, please be extra critical in review and local testing. This patch was part of an emergency fix for a specific user.
Ran the full flume-core unit test suite. Some failures in the form of "stuck" error states. Looking into it, but little time available.
  1. Overall I like the fact that everything gets moved into the checkconfigthread, and that encapsulation is a bit better.  
    
    I'm concerned about a few things things:
    * We've serialized the double config problem now -- before it would happen twice in different conflicting threads but now I believe that this would twice within the same thread.  
    * don't quite get why the volatiles are present.
    * the modifications to the tests -- you've started threads and wait for heartbeats where as before we could interleave different parts of the the heartbeat at will within the test cases.
  2. What happens when we reach the bound here?
  3. I think I prefer passing FlumeConfigData instead of string src string snk here.  This will update the version numbers properly from the master's clock.
  4. This seems like it does a double config now because the config times on this heartbeat are going to be newer (even though it is the same config).  This is safer than before because at least this time it is in the same thread.
  5. We probably need make interrupted the only way out -- catch runtime/throwables report then and then loop until interrupted.
  6. why volatile?
  7. this is used as a lock (does signal==lock?)
  8. It concerns me that test tests are now timing dependent and depend on sleep.  The previous version was setup so that we could interleave different actions or parts of actions in a way to test racy conditions.  This is no longer the case.
  9. 
      
Loading...