I can't help but mention that you should probably be using node.js with the jsdom module for such a task these days. You can get the complete power of jQuery with jsdom, making screen scraping child's play
It may not use the MIGHTY POWER OF JAVASCRIPT, but BeautifulSoup is a best-of-breed real-world HTML parser. Not just in the API, but in the verification that its parsing algorithm is effective against HTML found on real web sites. I find it unlikely that jsdom is actually significantly better. Does that code really look like it's going to be significantly improved with jQuery?
titles = [x for x in soup.findAll('td','title') if x.findChildren()][:-1]
The ability to use CSS/jQuery selectors is really nice, though; In order to find all <td>s whose parents have a class "blah", you have to use list comprehension:
tds = [td for td in soup.findAll('td') if td.parent.get('class') == 'blah']
In jQuery, this is more compactly written:
tds = $('.blah > td')
And if you just want to look for <td>s somewhere within a .blah element, you can use
tds = $('.blah td')
This is a lot less clear in BeautifulSoup:
tds = [td for td in soup.findAll('td') if td.findParents(attrs={'class': 'blah'})]
(If there are better ways to write this BeautifulSoup code, please let me know)
Selectors have some other benefits too - you can just go to the CSS file and grab the selector that matches what you want, and you can be reasonably sure it'll work in most cases.
My impression was that BeautifulSoup is actually getting very long in the tooth - I didn't use it much, but I have used a lot of Scrapy, a beautiful framework that completely supplants BeautifulSoup for scraping.
You can get the best of all worlds imo by using lxml, which supports the selectors you want, uses Python which I prefer, and in my experience lxml is more robust than BeautifulSoup.
I spent more than a year writing hundreds of scrapers that ran for weeks at a time. BeautifulSoup did not work out as well as lxml in practice. On extremely javascript heavy pages we used pyv8 actually.
No, it's like saying "The full YUI suite is overkill for just about everything, you should just use the core or jQuery".
'scrapy startproject' creates a couple nested directories, with maybe seven files. Are you writing a scraper that you're going to run regularly? Does it need to be super robust and maintainable? Or are you writing something that you'll run once, maybe twice?
I seem to be missing why you think using a framework is a bad thing. With say django or YUI there are performance and abstraction issues that can bite you, but I don't see those mattering for so lightweight a framework and tightly scoped a problem.