thanks for this. snowplow has been an amazing source of learning. I'm quite inte...

alexdean · on Nov 10, 2013

Hi sandGorgon! Thanks for the encouraging words. We haven't yet seen people use the existing Scalding ETL to pull data from Twitter or Facebook. As you suggest, there are some considerations around using Hadoop to access web APIs without getting throttled/banned. Here's a couple of links which might be helpful:

- http://stackoverflow.com/questions/6206105/running-web-fetch...

- http://petewarden.com/2011/05/02/using-hadoop-with-external-...

I think a gatekeeper service could make sense; or alternatively you could write something which runs prior to your MapReduce job and e.g. just loads your results into HDFS/HBase, for the MapReduce to then lookup into. Akka or maybe Storm could be choices here.

We have done a small prototype project to pull data out of Twitter & Facebook - that was only a Python/RDS pilot, but it gave us some ideas for how a proper social feed into Snowplow could work.

sandGorgon · on Nov 11, 2013

is your python code part of the snowplow repository/gist?

it would be very interesting to take a look at it.