HUE-6160 [sentry] Support Sentry HA with failover

Review Request #10044 - Created April 5, 2017 and submitted

Jenny Kim
hue
master
HUE-6160
81e6db4...
hue
enricoberti, johan, krish, ranade, romain, weixia
commit a5113379355a22d1ec706ef8a11e8fe71cdb9dd1
Author: Jenny Kim <jennykim@cloudera.com>
Date:   Wed Apr 5 18:44:10 2017 -0700

    HUE-6160 [sentry] Support Sentry HA with failover
    
    Attempts to use new sentry.service.client.server.rpc_address property to obtain list of hosts, but will fallback to Zookeeper properties if needed.

:100644 100644 6b11138349... 6500146b52... M	desktop/libs/libsentry/src/libsentry/api.py
:100644 100644 7512365a7a... ad9a011057... M	desktop/libs/libsentry/src/libsentry/api2.py
:100644 100644 23405544ea... 212bd8740f... M	desktop/libs/libsentry/src/libsentry/sentry_site.py

Manual (unit & integration tests forthcoming)

  • 2
  • 0
  • 13
  • 0
  • 15
Description From Last Updated
Might need to be moved to its own function, cf comment in test below Romain Rigaux
I am not sure here, why don't we use 'hostname' from [libsentry] and actually test without HA? Is _CONF_SENTRY_SERVER_RPC_ADDRESSES only ... Romain Rigaux
  1. 
      
  2. Could we remove the old Sentry way?

    (it was never finished on the Sentry side, so won't happen)

  3. desktop/libs/libsentry/src/libsentry/sentry_site.py (Diff revision 1)
     
     
     
     
     
     

    +1 tp delete all old references

  4. 
      
  1. Nit about opportunity to simplify?

  2. desktop/libs/libsentry/src/libsentry/api2.py (Diff revision 2)
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     

    Seems like we could trash all of this now?

    (no need to cache etc)

    (I don't remember the story, but seems like a big old copy paste because of cache, so maybe do a

    _get_client(username, client_class) ?

    )

  3. If empty it raises an exception above

  4. 
      
  1. +1 to add some live tests too, e.g. reset the clients and check if it works and that we cache the good clients, what happens if only 2 good hosts on 3, etc

    set_for_testing(host1, badHost2, badHost3)
    set_for_testing(badHost2, host1, badHost3)
    set_for_testing(badHost2, badHost3, host1)

    1. I think the "live" tests would best be handled in systest b/c they have the mechanisms to fault-inject into the Sentry servers. However, we can write unit tests in Hue at least, which utilize MockClients for the Sentry clients. Does that work?

  2. 
      
  1. 
      
  2. desktop/libs/libsentry/src/libsentry/sentry_site.py (Diff revision 4)
     
     
     
     
     
     
     
     
     
     
     
     

    Might need to be moved to its own function, cf comment in test below

  3. if is_ha_enabled(): ?

    1. Refactored a bit to first check rpc_addresses then check hostname/port and raise if neither is configured.

  4. I am not sure here, why don't we use 'hostname' from [libsentry] and actually test without HA?

    Is _CONF_SENTRY_SERVER_RPC_ADDRESSES only for HA? We still rely on 1st one as primary.

    1. Hmm good question, let me check with Alex, maybe rpc_addresses will always be there w/ or w/o HA.

    2. Alex says that rpc_addresses will either contain a single server for non-HA, or a list if HA. So we should always check rpc_addresses if it contains anything, and otherwise try to check for hostname/port in hue.ini, and if neither is present, raise an exception.

  5. 
      
  1. 
      
  2. Is the algo not too simplistic?:
    we retry N times, pick a random one each time.
    We could get the list minus the existing one, randomize, pop the first one, retry N times, if not good, pop the next one from the list?
    How long is supposed to take the failover? (are all the Sentry HA in non standby) If yes we don't need the retry counter anymore.

    1. This is the algorithm that Alex from Sentry recommended we use actually, b/c this list isn't really like Yarn where the first one is the primary and the second is the failover, this is more like a load-balancer list where all of the instances are supposed to be available and we should randomly pick one to spread the load. They also only intend to have 2 available by default for HA-setups, in which case, if we ignore the current one and only pick from the second one, it won't be able to return to the first one even if it comes back online.

    2. If the first one comes back online, it will be picked up in the next round of failover. Here I am talking about one specific failover round happens. If Senty Server #1 is down, it seems improbable that it would work again in the round. Could this be clarified?

      What is the UI UX in general when a failover happens? The Ajax calls should blocked until the failover logic is done.

  3. Exception: Are we clearer on which exception would be get in case of failover? One potential issue here is that we are too generic and start the rety process when facing a non related error.

    1. We can specifically catch a StructuredThriftException for the failover logic, and for all other Exceptions log and exit?

  4. Do we at least cache the host in case of failover? (to not keep retrying on the bad host) Or is the client cached in the thrift pool?

    1. I thought you asked that take out the API_CACHE in a previous review, but we can re-add it. It isn't cached in the thrift pool.

  5. TestCases, we should test when it is not supposed to work:
    no sentry servers
    all sentry server bad
    Test with from libsentry.api2 import get_api
    Test when random exception sent by API (cf. above on detection of failover)

  6. Test [] might fail later if we have roles?

  7. 
      
  1. Algo seems weak, +1 if we get an officialy sign off of the Sentry Lead over the random sorting of the list and trying each one
    How is the UX in the UI when a failover occure?
    There is a bunch of code duplication until we support only v2.

    1. UI blocks and displays a spinning wheel until a successful connection is made, or a popup exception is shown when it runs out of retries.

  2. 
      
Review request changed

Status: Closed (submitted)

Loading...