Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

thanks for this. snowplow has been an amazing source of learning. I'm quite interested in the etl process than in the actual mapreduce.

have you seen your etl used to pull data from Twitter or Facebook. I am wondering what is the state of art there considering throttling, etc.



Hi sandGorgon! Thanks for the encouraging words. We haven't yet seen people use the existing Scalding ETL to pull data from Twitter or Facebook. As you suggest, there are some considerations around using Hadoop to access web APIs without getting throttled/banned. Here's a couple of links which might be helpful:

- http://stackoverflow.com/questions/6206105/running-web-fetch...

- http://petewarden.com/2011/05/02/using-hadoop-with-external-...

I think a gatekeeper service could make sense; or alternatively you could write something which runs prior to your MapReduce job and e.g. just loads your results into HDFS/HBase, for the MapReduce to then lookup into. Akka or maybe Storm could be choices here.

We have done a small prototype project to pull data out of Twitter & Facebook - that was only a Python/RDS pilot, but it gave us some ideas for how a proper social feed into Snowplow could work.


is your python code part of the snowplow repository/gist?

it would be very interesting to take a look at it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: