FLUME-358: Real-time data source from a hosted-hub that uses pubsubhubbub

Review Request #1260 - Created Nov. 29, 2010 and updated

Dani Abel Rayan
This implements the Flume source for superfeedr - a hosted hub implementation.
http://superfeedr.com/ is used by Gowalla, Digg, Friendfeed and even Google... Although this is not a generic hub. They already have many "sinks" contributed to. For instance: couchdb https://github.com/superfeedr/couchpubtato

They give free 25,000 credits for everyone, for which they will push data. If we contribute some tools to them like ones in https://github.com/superfeedr/ then they have Hackr plan, wherein they will push real-time data for free.
I am not sure, if vision of Flume is to extend to real-time analysis of all kinds of data or just logs - LET me know.

Why flume ? Obviously since there are many kinds of hubs and hubs push data, hubs become central hotspots and then Flume can be used to aggregate data from "different" hubs.
We can have local hubs pushing data internal to organization. Any app supporting "Webhooks" can ping a HUB. Check this http://elasticdata.wordpress.com/2009/06/02/business-glue-webhooks-for-the-enterprise/
This will help collaboration. I will send a document with more details, if you folks are interested. I am also working on a local hub from which we can consume realtime data, I have some courses next term, wherein I am planning to use Flume this way.
Yes. Demo ready :)
  2. plugins/superfeedrSource/build.xml (Diff revision 2)
    Should this be licensed to ASF or Cloudera?
    1. Yes. It has to be for Cloudera. Changed to Cloudera.
  3. plugins/superfeedrSource/build.xml (Diff revision 2)
    Should depend on slf4j
  4. plugins/superfeedrSource/build.xml (Diff revision 2)
    The plugin shouldn't need hadoop, right? (or guava 2 lines up)
    1.  FlumeConfiguration.get() uses it.
  5. plugins/superfeedrSource/build.xml (Diff revision 2)
  6. This seems like it should be something else ...
    1. ~Done~ Had to change the dir structure. 
  7. If this is going into Flume, you should be using SLF4J.
    1. ~Done~, But let me know why .
  8. Should a class name start with a lowercase letter?
    1. Java Noobie from C world. pardon :)
  9. This and username can be final.
  10. It doesn't look like you're using this?
  11. This is outside of your control here ... do you really care enough that it should be a warning log?
    1. Lets get it checked into flume, will present to superfeedr guys. Only superfeedr guys will know if this will be called. No docs.
  12. Why should each source get modified to prepare data for the Attr2HBase sink?
    1. Thats how attr2hbase sink is designed. Every attribute with 2hb_ prefix will be put into hbase with same col name as prefix (2hb_) stripped column name.
  13. What about the other fields in the EntryExtension?
    Should you set the date on the Event to be what the date was on the EntryExtension? (If so, which date?)
    1. Let user decide what to use. This is just a skeleton code. 
      https://github.com/superfeedr has a community
  14. Should not call printStackTrace, should be logging.
  15. Should not call printStackTrace, should be logging.
  16. We need to get QueuedSourceBase in so that you can get rid of this method ... it has several issues:
     * It doesn't poll to allow a graceful close.
     * It isn't updating stats.
     * It should log the exception, not print the stack trace.
    But better would be to just use QueuedSourceBase after we get Jon to commit it.
    1. Ok. as of now using feedr.close and logging error.
  17. Should probably do something here?
  18. Since you keep feedr around, if the source were re-opened (if you supported gracefully closing it), you'd want to null it out here probably? Is there anything else that has to be done to properly dispose of a feedr object?
  19. Log, not print stack trace again ...
Review request changed
  2. plugins/superfeedrsource/README (Diff revision 3)
    add a quick explanation of what superfeedr hub is.
    Probably don't need "you can cahoose your own sink"
  3. plugins/superfeedrsource/README (Diff revision 3)
    The steps below show how .. sink.  This is good for debugging and demonstration purposes.
    (It does serve a purpose)
  4. plugins/superfeedrsource/README (Diff revision 3)
    update link.
  5. plugins/superfeedrsource/README (Diff revision 3)
    ../flume/$ ant
  6. plugins/superfeedrsource/README (Diff revision 3)
    ../flume/plugins/superfeedrsource$ ant
  7. plugins/superfeedrsource/README (Diff revision 3)
  8. plugins/superfeedrsource/README (Diff revision 3)
    maybe just use this:
    $ bin/flume dump 'superfeedrSource(xxxxx)'
  9. is this superfeedr specific or will it work for all pubsubhubbub hubs? 
  10. Style -- exit from if with return and reduce the depth of else block.
    for (Iterator....
  11. actually, probably don't even need first if -- will fall straight through loop.
    Java style:
    for (ItemExtension item : event.getItems().getItems()) {
      Event e = ...
  12. change to be:
    LOG.error(xxxx, e1)
    Remove e1.printStackTrace();
  13. add exception argument:
    LOG.error("xxx", e);
  14. plugins/superfeedrsource/src/java/com/cloudera/flume/superfeedr/SuperfeedrSource.java (Diff revision 3)
    To be completely correct, probably need a latch that awaits until countdown is called as a finalizer around line 145.
  15. usage: superfeedrSource(username, password, url1[, url2[,...]])
  16. LOG and throw IllegalArgumentException.