Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Unfortunately someone would raise a privacy stink just like they did for the netflix prize data and the aol search data. This is why we can't have nice things.


But Delicious's database is already public (if you take out private fields on the user table and the private links). Even just the links + tags without any user info would rock for semantic Web usage.


It's public right now. It won't be once Yahoo pulls the plug on Delicious.


User-agent: * Disallow: /

I don't remember the robots.txt rules for sure, but doesn't that mean they don't allow crawlers at all?


That's the rule for crawlers that aren't Slurp, Googlebot, Teoma, or msnbot.


I noticed the extra rules, but I am neither Slurp, Googlebot, Teoma nor msnbot :-(


robots.txt is merely a suggestion.


It's public data! You could scrape and index it now for free if you wanted...


They even throttle the friendfeed scraper which graciously pulls all its users data at once.

You can't write a simple scraper that is not distributed in 100 of machines across the web to pull out their data.


I heard that there is this thing called "the cloud" where you can rent services based on the work time. That makes cheapo servers both realistic and quite simple ;)

Actually I just noticed you get 750h of free micro instance time from aws... I wonder if it would be worth doing. I imagine the link+tags are <100GB in total.


Though I've noticed only pages up to 200 work when going back through history.. this only gets you a few days back on the most popular tags.


Sure, but user pages go back farther than that...


"... It's public data! You could scrape and index it now for free ..."

That is the most insightful thing I've read today.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: