Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Don't write your own crawler. Use nutch.

It is designed to scale and do mapreduce kind of parallel processing. I would strongly recommend you to take a look before writing your own.

http://lucene.apache.org/nutch/



mapreduce? Just how many requests will you be making to third-party sites at once? Sounds like a good way to get blocked fast.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: