Solr Smart Indexer

Review Request #7790 - Created June 15, 2016 and updated

Aaron Peddle
hue
solr-smart-indexer
hue
romain
commit 4a5bb397d2002cfe6aa0edefa2c9be234e3c5df3
Author: Aaron Peddle <aaron.peddle@cloudera.com>
Date:   Wed Jun 15 14:59:40 2016 -0700

    smart indexer cleanup and refactor

:100644 100644 5250e41... a44a73f... M	desktop/libs/indexer/src/data/oozie_workspace/morphline_template.conf
:100644 100644 f0aa71e... fb1cf02... M	desktop/libs/indexer/src/indexer/smart_indexer.py
:100644 100644 5ef8ed4... 480ac6c... M	desktop/libs/indexer/src/indexer/tests_indexer.py

commit e4e49e8eea45f3c51e1fa34822caa4df651b3a2a
Author: Aaron Peddle <aaron.peddle@cloudera.com>
Date:   Wed Jun 15 12:43:27 2016 -0700

    smart_indexer refactor

:000000 100644 0000000... 5250e41... A	desktop/libs/indexer/src/data/oozie_workspace/morphline_template.conf
:000000 100644 0000000... 65b3ec0... A	desktop/libs/indexer/src/data/oozie_workspace/workflow.xml
:100644 100644 be4e3b8... a08f96f... M	desktop/libs/indexer/src/indexer/conf.py
:100644 100644 a56093a... f0aa71e... M	desktop/libs/indexer/src/indexer/smart_indexer.py
:100644 000000 b776392... 0000000... D	desktop/libs/indexer/src/indexer/templates/morphline_template.conf
:100644 000000 ffbb616... 0000000... D	desktop/libs/indexer/src/indexer/templates/schema.mako
:100644 100644 71e0daa... 5ef8ed4... M	desktop/libs/indexer/src/indexer/tests_indexer.py

commit b4c9ada2c0194d8beb798cd26b25a8fb364f160e
Author: Aaron Peddle <aaron.peddle@cloudera.com>
Date:   Wed Jun 15 13:22:55 2016 -0700

    solr smart indexer end to end

:100644 100644 4572726... 079343e... M	desktop/libs/indexer/src/data/solrconfigs/solrcloud/conf/solrconfig.xml
:100644 000000 114eb60... 0000000... D	desktop/libs/indexer/src/indexer/indexer.py
:100644 000000 b6739d3... 0000000... D	desktop/libs/indexer/src/indexer/simple.csv
:000000 100644 0000000... a56093a... A	desktop/libs/indexer/src/indexer/smart_indexer.py
:100644 100644 e46490d... b776392... M	desktop/libs/indexer/src/indexer/templates/morphline_template.conf
:000000 100644 0000000... 71e0daa... A	desktop/libs/indexer/src/indexer/tests_indexer.py
:100644 100644 ddf7219... 06f413b... M	desktop/libs/libsolr/src/libsolr/api.py

commit 451909e5d4dda0a1e0b1142b7a5a553cc826bd3a
Author: peddle <peddle.aaron@gmail.com>
Date:   Fri Jun 3 11:02:08 2016 -0700

    simple indexer api for basic csv case

:000000 100644 0000000... 114eb60... A	desktop/libs/indexer/src/indexer/indexer.py
:000000 100644 0000000... b6739d3... A	desktop/libs/indexer/src/indexer/simple.csv
:000000 100644 0000000... e46490d... A	desktop/libs/indexer/src/indexer/templates/morphline_template.conf
:000000 100644 0000000... ffbb616... A	desktop/libs/indexer/src/indexer/templates/schema.mako


  • 40
  • 0
  • 0
  • 0
  • 40
Description From Last Updated
nice! later it might be be more functions that way we can call or not Romain Rigaux
licence? Romain Rigaux
to parameterize both hue-aaron-1.vpc.cloudera.com and output e.g. ${nameNode}${output-dir} Romain Rigaux
same ${zk-host} then in the tests, you can get the value directly by index/conf.py zkensemble() Romain Rigaux
same Romain Rigaux
same Romain Rigaux
${file_path} ${file-path} ${filePath} standardize on one? filePath as we already have jobTracker? Romain Rigaux
INDEXING_TEMPLATES_PATH ? Romain Rigaux
from indexer.conf import CONFIG_OOZIE_WORKSPACE_PATH Romain Rigaux
init(username, fs): might simplify Romain Rigaux
in https://github.com/cloudera/hue/blob/master/desktop/libs/liboozie/src/liboozie/conf.py#L40 'import time' on top def get_remote_deployment_dir(username, job_id): return REMOTE_DEPLOYMENT_DIR.get().replace('$USER', username).replace('$TIME', str(time.time())).replace('$JOBID', str(job_id)) then just get_remote_deployment_dir(self.username, index_uuid) Romain Rigaux
space after each : ? Romain Rigaux
need to remove I think Romain Rigaux
This can be gotten with similar logic: https://github.com/cloudera/hue/blob/master/desktop/libs/liboozie/src/liboozie/submission2.py#L235 Actually, if you have the good mapping {collectioName, filePath, workspacePath}, this should ... Romain Rigaux
let's make this one cleaner later, just in the meantime, maybe /tmp/smart_indexer_lib and put the content of the lib in ... Romain Rigaux
workspacePath Romain Rigaux
will need to return the oozie job id for later Romain Rigaux
FileFormat.get_instance(data['file']).to_dict() Romain Rigaux
dict? Romain Rigaux
todo later: difference betweeb string and text Romain Rigaux
ditto Romain Rigaux
Exception --> ValueError ? Romain Rigaux
move to top? Romain Rigaux
I think this might not be python 2.6 compatible? Romain Rigaux
why dict? Romain Rigaux
todo later, maybe HueLogsFormat (all predefined), HiveTableFormat.. Romain Rigaux
CSVType or CSVFormat ? Romain Rigaux
maybe add a *5 too ? Romain Rigaux
space after each : ? Romain Rigaux
bit complex? Romain Rigaux
enumerate? Romain Rigaux
sample_rows = itertools.islice(reader, NUM_SAMPLE) ? Romain Rigaux
enumerate? Romain Rigaux
here why not drop the local file and just create a file on HDFS with simpleCSVString content? Romain Rigaux
try to delete the collection if it exists before? Romain Rigaux
100 chars by line is fine Romain Rigaux
note about 'solr' user Romain Rigaux
nice, reuse solr type conversion, will see later how to update solr field types and attributes Romain Rigaux
https://github.com/cloudera/hue/blob/master/desktop/libs/notebook/src/notebook/tests.py#L41 About solr user --> just use real 'test' user Romain Rigaux
revert? Romain Rigaux
  1. 
      
  2. desktop/libs/indexer/src/data/oozie_workspace/morphline_template.conf (Diff revision 1)
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     

    nice!

    later it might be be more functions that way we can call or not

  3. to parameterize both hue-aaron-1.vpc.cloudera.com and output

    e.g.

    ${nameNode}${output-dir}

  4. same

    ${zk-host}

    then in the tests, you can get the value directly by

    index/conf.py zkensemble()

  5. ${file_path}
    ${file-path}
    ${filePath}

    standardize on one?

    filePath as we already have jobTracker?

  6. INDEXING_TEMPLATES_PATH ?

  7. from indexer.conf import CONFIG_OOZIE_WORKSPACE_PATH

  8. init(username, fs):

    might simplify

  9. in https://github.com/cloudera/hue/blob/master/desktop/libs/liboozie/src/liboozie/conf.py#L40

    'import time' on top

    def get_remote_deployment_dir(username, job_id):
    return REMOTE_DEPLOYMENT_DIR.get().replace('$USER', username).replace('$TIME', str(time.time())).replace('$JOBID', str(job_id))

    then

    just

    get_remote_deployment_dir(self.username, index_uuid)

  10. space after each : ?

  11. need to remove I think

  12. desktop/libs/indexer/src/indexer/smart_indexer.py (Diff revision 1)
     
     
     
     
     

    This can be gotten with similar logic: https://github.com/cloudera/hue/blob/master/desktop/libs/liboozie/src/liboozie/submission2.py#L235

    Actually, if you have the good mapping {collectioName, filePath, workspacePath}, this should even just work https://github.com/cloudera/hue/blob/master/apps/oozie/src/oozie/views/dashboard.py#L892

  13. let's make this one cleaner later, just in the meantime, maybe

    /tmp/smart_indexer_lib

    and put the content of the lib in dropbox + README about which packages to install and which jars to grab

    ?

  14. will need to return the oozie job id for later

  15. todo later: difference betweeb string and text

  16. Exception --> ValueError ?

  17. I think this might not be python 2.6 compatible?

  18. todo later, maybe HueLogsFormat (all predefined), HiveTableFormat..

  19. CSVType or CSVFormat ?

  20. maybe add a *5 too ?

  21. desktop/libs/indexer/src/indexer/smart_indexer.py (Diff revision 1)
     
     
     
     
     
     
     

    space after each : ?

  22. desktop/libs/indexer/src/indexer/smart_indexer.py (Diff revision 1)
     
     
     
     
     
     

    sample_rows = itertools.islice(reader, NUM_SAMPLE)

    ?

  23. desktop/libs/indexer/src/indexer/tests_indexer.py (Diff revision 1)
     
     
     
     
     

    here why not drop the local file and just create a file on HDFS with simpleCSVString content?

  24. try to delete the collection if it exists before?

  25. 100 chars by line is fine

  26. note about 'solr' user

  27. nice, reuse solr type conversion, will see later how to update solr field types and attributes

  28. 
      
  1. 
      
  2. FileFormat.get_instance(data['file']).to_dict()

  3. https://github.com/cloudera/hue/blob/master/desktop/libs/notebook/src/notebook/tests.py#L41 About solr user --> just use real 'test' user

  4. 
      
Loading...