Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I would suggest that you consider a couple things:

1) How can you break down the work tasks your site/crawler is doing so that it could be divvied out across N processes? In the case of the crawler, how would you handle breaking up your work queue so that multiple spiders could coordinate their efforts without all crawling the same pages and without skipping any? How do you recombine the results back together when they're finished? What happens if a spider grabs a URL to crawl and dies before it can report in the results - will it still get crawled or has it been lost?

2) Google's view of mass concurrency looks at hardware in the same sort of way. They considered questions like: What happens to the running processes if this disk drive goes offline? How can work being processed on multiple computers be recombined together? What happens when a critical computing node dies while you're waiting on it to report its results?

3) As pertains to database and other resources: how can the data needed to power the site be distributed across my computer resources as evenly as possible? How do I go about identifying and correcting bottlenecks?

It does not seem like you need Hadoop to either think through these problems or work on coding solutions that take this sort of thing into account. Though if you build a system capable of mass concurrency, I can understand the desire to want to test it out in a widely distributed environment.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: