I can't help but mention that you should probably be using node.js with the jsdo...

jerf · on April 19, 2011

It may not use the MIGHTY POWER OF JAVASCRIPT, but BeautifulSoup is a best-of-breed real-world HTML parser. Not just in the API, but in the verification that its parsing algorithm is effective against HTML found on real web sites. I find it unlikely that jsdom is actually significantly better. Does that code really look like it's going to be significantly improved with jQuery?

    titles = [x for x in soup.findAll('td','title') if x.findChildren()][:-1]

packs a lot of punch.

meatmanek · on April 19, 2011

The ability to use CSS/jQuery selectors is really nice, though; In order to find all <td>s whose parents have a class "blah", you have to use list comprehension:

  tds = [td for td in soup.findAll('td') if td.parent.get('class') == 'blah']

In jQuery, this is more compactly written:

  tds = $('.blah > td')

And if you just want to look for <td>s somewhere within a .blah element, you can use

  tds = $('.blah td')

This is a lot less clear in BeautifulSoup:

  tds = [td for td in soup.findAll('td') if td.findParents(attrs={'class': 'blah'})]

(If there are better ways to write this BeautifulSoup code, please let me know)

Selectors have some other benefits too - you can just go to the CSS file and grab the selector that matches what you want, and you can be reasonably sure it'll work in most cases.

joshu · on April 20, 2011

Lxml in python lets you use CSS selectors.

cdr · on April 20, 2011

My impression was that BeautifulSoup is actually getting very long in the tooth - I didn't use it much, but I have used a lot of Scrapy, a beautiful framework that completely supplants BeautifulSoup for scraping.

henrybaxter · on April 19, 2011

You can get the best of all worlds imo by using lxml, which supports the selectors you want, uses Python which I prefer, and in my experience lxml is more robust than BeautifulSoup.

I spent more than a year writing hundreds of scrapers that ran for weeks at a time. BeautifulSoup did not work out as well as lxml in practice. On extremely javascript heavy pages we used pyv8 actually.

edit: more information at http://blog.ianbicking.org/2008/12/10/lxml-an-underappreciat... the comments are useful too.

cdr · on April 20, 2011

Even better use Scrapy, which is a whole framework designed specifically for scraping and is built on top of libxml2 like lxml.

krakensden · on April 20, 2011

Scrapy is overkill for nearly everything. You'll probably have under a page of code using lxml and urllib.

cdr · on April 20, 2011

I have under a page of code with Scrapy for simple projects, and more advanced features when I need them.

That's like saying "jQuery is overkill for just about everything, you should use plain javascript".

krakensden · on April 21, 2011

No, it's like saying "The full YUI suite is overkill for just about everything, you should just use the core or jQuery".

'scrapy startproject' creates a couple nested directories, with maybe seven files. Are you writing a scraper that you're going to run regularly? Does it need to be super robust and maintainable? Or are you writing something that you'll run once, maybe twice?

cdr · on April 29, 2011

I seem to be missing why you think using a framework is a bad thing. With say django or YUI there are performance and abstraction issues that can bite you, but I don't see those mattering for so lightweight a framework and tightly scoped a problem.

bballbackus · on April 20, 2011

The combination of BeautifulSoup and mechanize also makes tasks incredibly simple.

njs12345 · on April 19, 2011

JSoup (http://jsoup.org) gives Java similiar capabilities :)