Hi sandGorgon! Thanks for the encouraging words. We haven't yet seen people use the existing Scalding ETL to pull data from Twitter or Facebook. As you suggest, there are some considerations around using Hadoop to access web APIs without getting throttled/banned. Here's a couple of links which might be helpful:
I think a gatekeeper service could make sense; or alternatively you could write something which runs prior to your MapReduce job and e.g. just loads your results into HDFS/HBase, for the MapReduce to then lookup into. Akka or maybe Storm could be choices here.
We have done a small prototype project to pull data out of Twitter & Facebook - that was only a Python/RDS pilot, but it gave us some ideas for how a proper social feed into Snowplow could work.
have you seen your etl used to pull data from Twitter or Facebook. I am wondering what is the state of art there considering throttling, etc.