[core] Stream CSV download and limit XLS download

Review Request #4325 — Created April 23, 2014 and submitted

abec
old-hue-rw
hue
enricoberti, romain
commit bdab9f925caa22b0acb7ee8dcd19281dc15b9cdc
Author: Abraham Elmahrek <abraham@elmahrek.com>
Date:   Wed Apr 23 14:54:53 2014 +0300

    [core] Stream CSV download and limit XLS download
    
    CSV download can be streamed because that format can be contatenated easily.
    Tablib XLS has a max size of 65000 rows, so it must be limited.
    Also added a "starting yield" so that downloads start early rather than later.

:100644 100644 b76ecc3... c3517d5... M	apps/beeswax/src/beeswax/data_export.py
:100644 100644 13e7b5c... 94f3b13... M	apps/search/src/search/data_export.py
:100644 100644 c47ccd5... 00988f2... M	desktop/core/src/desktop/lib/export_csvxls.py
A couple of export tests are failing.

I'd like to make sure this idea is sound and accepted before fixing the tests.
  • 2
  • 0
  • 0
  • 1
  • 3
Description From Last Updated
we could fail but I think it is good to return truncated data for now romain romain
and I looked no way to get the row count of a dataset. Maybe we should do a special call ... romain romain
romain
  1. Yes, that seems better than before (if it does no break stuff :). 
    
    My main worry is that we should either fail earlier for the download, or find a way to stream the xls as it will clog Hue. No way to change tablib or limit xls size even more that way it is limited but stable?
  2. we could fail but I think it is good to return truncated data for now
    1. Raise exception makes sense actually. You can limit queries. The problem is that exceptions will be raised later on in the life cycle of the request and cannot be captured.
  3. and I looked no way to get the row count of a dataset. 
    
    Maybe we should do a special call that emulate a count() on a data set (e.g. fetch 1000 rows, don't deserialize it, keep counting then start_over). 
    
    Or maybe we should go less that 60k.
    1. Let's just go less than 60k. Complex queries will inherently be difficult to count. 
  4. 
      
abec
abec
romain
  1. This is great!
    
    My main worry is FETCH_SIZE that will blow up HS2 or Hue it is too large. Other one could be build on top of this patch (but are high priority)
  2. apps/beeswax/src/beeswax/data_export.py (Diff revision 3)
     
     
    this won't work with long rows, it will just blow up in memory
    
    did you test that? I could repro it when I picked 1000
    1. I didn't test this. I looked at the original implementation and followed along. I think fetching fewer times may be easier for python memory management. We could also add a config variable?
    2. Let's put 1000 for now
  3. +1
    
    shouldn't we compute a number of cells instead? if not 2 cols with 60k rows won't be downloaded even if it should work. MAX_XLS_CELLS = 60000 * 20 (or we need a good estimate of memory by row, probably tricky)
    
    + config param maybe at some point that way we can lower it if needed without code change
    1. Eh, if we count the number of cells, we should count the content as well. Let's keep it simple and just count rows?
    2. "Other one could be build on top of this patch (but are high priority)" we can follow up after
  4. if this improves it, we should probably add it in the view API that fetch rows
    1. Indeed.
    2. Actually, this makes sense for the download APIs, but not the views. The views don't have the issue where the data is fetched after the traditional view has finished.
  5. 
      
abec
Review request changed

Status: Closed (submitted)

bcwalrus
  1. 
      
  2. Would make it configurable.
  3. 
      
Loading...