HUE-8747 [editor] Download query result as task

Review Request #13711 - Created March 14, 2019 and submitted

Jean-Francois Desjeans Gauthier
hue
master
HUE-8747
hue
jgauthier
commit 109e4eb4079b32dd8d63910cad9fec902d5bb1c8
Author: jdesjean <jgauthier@cloudera.com>
Date:   Thu Mar 14 13:58:40 2019 -0700

    HUE-8747 [editor] Download query result as task

:100644 100644 61e6185b15... 26d33c6da0... M	desktop/core/src/desktop/lib/export_csvxls.py
:000000 100644 0000000000... bb364e91a2... A	desktop/libs/notebook/src/notebook/tasks.py
:100644 100644 aea629f4d6... 5a58952796... M	desktop/libs/notebook/src/notebook/views.py

Pushed to HUE-CELERY

  • 7
  • 0
  • 8
  • 2
  • 17
Description From Last Updated
Long term, maybe we could switch to pandas dataframes, which supports many formats like to_parquet, to_... https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html Probably better than ... Romain Rigaux
We can see for having it clean when we do the cache system. Romain Rigaux
For small queries 1s might be too small for starting Romain Rigaux
Potential infinite loop here if unexpected issues? IIRC there is also a TTL for celery tasks Romain Rigaux
In practice we can fetch right away after the finish status? (i.e. writing to the cache ad reading from it ... Romain Rigaux
Or even standard error as query not finished Romain Rigaux
Have a top variable to avoid some of the duplication? result = { 'has_more': False, 'data': [], 'meta': [], 'type': ... Romain Rigaux
  1. 
      
  2. desktop/libs/notebook/src/notebook/views.py (Diff revision 2)
     
     
     
     
     
     
     
     
     
     
     
     
     

    Move this code to tasks.py

  3. 
      
  1. Quick nits just FYI

  2. desktop/libs/notebook/src/notebook/tasks.py (Diff revision 2)
     
     
     
     

    Not needed I believe as celery is setup via desktop conf now

  3. desktop/libs/notebook/src/notebook/tasks.py (Diff revision 2)
     
     
     

    Same, just need to importer from https://github.com/cloudera/hue/blob/master/desktop/core/src/desktop/celery.py#L20

  4. 
      
  1. Really nice!

    +1 on TODOs indeed. Just some High level toughts

  2. apps/beeswax/src/beeswax/data_export.py (Diff revision 6)
     
     

    Long term, maybe we could switch to pandas dataframes,

    which supports many formats like to_parquet, to_...

    https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

    Probably better than custom CSV.

    Parquet files have the headers types embedded. This would work for any DB. We could write to local storage, then remote object store or redis in the cache API.

    1. I agree it would be cleaner. My initial thought is that if user requests csv download, we can just serve the file as is right away. Custom format means we need to convert before serving, means we need another task.

  3. nit: wondering if generic way to detect non callables

  4. nit: postdict={}

  5. We can see for having it clean when we do the cache system.

  6. For small queries 1s might be too small for starting

    1. Did you mean large queries? My goal is to make fast queries, faster in Hue. We could probably make this smarter choosing something appropriate based on the status (e.g. if in queue, increase timeout)

  7. Potential infinite loop here if unexpected issues?
    IIRC there is also a TTL for celery tasks

    1. What would be an appropropriate amount of time to give up?

  8. nit: not parathesis

  9. Typical Python pattern to avoid leaks:

    postdict={}

    -->

    postdict=None

    if postdict is None:
    postdict = {}

  10. In practice we can fetch right away after the finish status? (i.e. writing to the cache ad reading from it at the same time does not conflict? no need to fetch from real API in parallel too if cache not ready)

    1. This is not an issue, because we can return 0 results. Client will currently keep fetching until it gets results back.

  11. Or even standard error as query not finished

    1. I'm not sure throwing error is right thing to do here. See comment above.

  12. progress_path is the number of rows already persisted?

    1. progress_path is the number of rows already fetched by client.

  13. desktop/libs/notebook/src/notebook/views.py (Diff revision 6)
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
  14. 
      
  1. Can't wait to try it!

  2. If the query errors do we exit?

  3. Add todo on truncation?

  4. desktop/libs/notebook/src/notebook/tasks.py (Diff revision 7)
     
     
     
     
     
     
     

    Have a top variable to avoid some of the duplication?

    result = {
    'has_more': False,
    'data': [],
    'meta': [],
    'type': 'table'
    }

  5. 
      
Review request changed

Status: Closed (submitted)

Loading...