Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

you can write an entire web scraper with just a url using http://scrape.ly

With scrape.ly I can just do this to crawl the entire HackerNews site across pages and grab the urls and extract any data from the page it lands on without defining any fields (it discovers them on it's own) and so doesn't require you to 'relabel' fields when the site changes layouts. It also generates new IP addresses on the fly so you don't get stuck and launches multiple threads for you to speed up the process. It works fully with ajax sites and single page apps. Flash support is coming too.

    http://scrape.ly/s/{https://news.ycombinator.com/}
    {next:More}{Space Monkey dumps Python for Go}*{fields:'Auto'}
Honest question (I don't mind downvoting if you disagree), but why would you want to waste time writing web scrapers, maintaining it to run and fixing the code? Multiply it by 100 or 1000 different websites and it becomes a full-time job. For me, I want to get the data I need with the least possible of overhead and as soon as possible and I don't really want to be bothered with setting up environments and hosting for it to run and fixing bugs when sites change layout.


This is not a post about web scraping. It's a post about doing something in Factor.


    Web scraping with Factor


Are you familiar with the idea of implementing common problems for the sake of pedagogy? For example, someone who might want to demonstrate how a particular programming language can be used might start a blog, and in that blog said person might post articles demonstrating how you could attack a particular problem in that language.

Your criticism of this post comes across as tone-deaf. You might as well have written the editors of Beautiful Code to lecture them about how the chapter on quicksort is horribly misguided and that everything a good software craftsman should ever care to know on the subject can be found at http://docs.oracle.com/javase/7/docs/api/java/util/Arrays.ht...)


Honestly, I meant no harm. I saw that we were talking about web scraping in other languages like PHP and Python, and I wanted to add on to the idea above that Factor doesn't really provide additional value than any other implementation of the job in another language would. They equally share the same overhead associated with web scraping activity that must lay on the shoulder of the developer. All in all, I wanted to highlight that one shouldn't put so much effort into creating web scrapers, and suggested a different tool that is specialized for the same job mentioned in the article.


Or if you're a python enthusiast, then shameless self link: http://jakeaustwick.me/python-web-scraping-resource/


pretty good but that's an awful lot of reading and lot of work just to grab some data from a simple website.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: