Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I can't help but mention that you should probably be using node.js with the jsdom module for such a task these days. You can get the complete power of jQuery with jsdom, making screen scraping child's play


It may not use the MIGHTY POWER OF JAVASCRIPT, but BeautifulSoup is a best-of-breed real-world HTML parser. Not just in the API, but in the verification that its parsing algorithm is effective against HTML found on real web sites. I find it unlikely that jsdom is actually significantly better. Does that code really look like it's going to be significantly improved with jQuery?

    titles = [x for x in soup.findAll('td','title') if x.findChildren()][:-1]
packs a lot of punch.


The ability to use CSS/jQuery selectors is really nice, though; In order to find all <td>s whose parents have a class "blah", you have to use list comprehension:

  tds = [td for td in soup.findAll('td') if td.parent.get('class') == 'blah']
In jQuery, this is more compactly written:

  tds = $('.blah > td')
And if you just want to look for <td>s somewhere within a .blah element, you can use

  tds = $('.blah td')
This is a lot less clear in BeautifulSoup:

  tds = [td for td in soup.findAll('td') if td.findParents(attrs={'class': 'blah'})]

(If there are better ways to write this BeautifulSoup code, please let me know)

Selectors have some other benefits too - you can just go to the CSS file and grab the selector that matches what you want, and you can be reasonably sure it'll work in most cases.


Lxml in python lets you use CSS selectors.


My impression was that BeautifulSoup is actually getting very long in the tooth - I didn't use it much, but I have used a lot of Scrapy, a beautiful framework that completely supplants BeautifulSoup for scraping.


You can get the best of all worlds imo by using lxml, which supports the selectors you want, uses Python which I prefer, and in my experience lxml is more robust than BeautifulSoup.

I spent more than a year writing hundreds of scrapers that ran for weeks at a time. BeautifulSoup did not work out as well as lxml in practice. On extremely javascript heavy pages we used pyv8 actually.

edit: more information at http://blog.ianbicking.org/2008/12/10/lxml-an-underappreciat... the comments are useful too.


Even better use Scrapy, which is a whole framework designed specifically for scraping and is built on top of libxml2 like lxml.


Scrapy is overkill for nearly everything. You'll probably have under a page of code using lxml and urllib.


I have under a page of code with Scrapy for simple projects, and more advanced features when I need them.

That's like saying "jQuery is overkill for just about everything, you should use plain javascript".


No, it's like saying "The full YUI suite is overkill for just about everything, you should just use the core or jQuery".

'scrapy startproject' creates a couple nested directories, with maybe seven files. Are you writing a scraper that you're going to run regularly? Does it need to be super robust and maintainable? Or are you writing something that you'll run once, maybe twice?


I seem to be missing why you think using a framework is a bad thing. With say django or YUI there are performance and abstraction issues that can bite you, but I don't see those mattering for so lightweight a framework and tightly scoped a problem.


The combination of BeautifulSoup and mechanize also makes tasks incredibly simple.


JSoup (http://jsoup.org) gives Java similiar capabilities :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: