pip install pyquery
For parsing HTML in Python, Beautiful Soup is oft recommended and it does a great job. It sports a good pythonic API and it's easy to find introductory guides on the web. All is good in parsing-land .. until you want to parse more than a dozen documents at a time and immediately run head-first into performance problems. It's - simply put - very, very slow.
Just how slow? Check out this chart from the excellent Python HTML Parser comparison Ian Bicking compiled in 2008:
What immediately stands out is how fast lxml is. Compared to Beautiful Soup, the lxml docs are pretty sparse and that's what originally kept me from adopting this mustang of a parsing library. lxml is pretty clunky to use. Yeah you can learn and use Xpath or cssselect to select specific elements out of the tree and it becomes kind of tolerable. But once you've selected the elements that you actually want to get, you have to navigate the labyrinth of attributes lxml exposes, some containing the bits you want to get at, but the vast majority just returning None. This becomes easier after a couple dozen uses but it remains unintuitive.
So either slow and easy to use or fast and hard to use, right?
Oh PyQuery you beautiful seductress:
from pyquery import PyQuery
page = PyQuery(some_html)
last_red_anchor = page('#container > a.red:last')
Easy as pie. It's ever-beloved jQuery but in Python!
There are some gotchas, like for example that PyQuery, like jQuery, exposes its internals upon iteration, forcing you to re-wrap:
for paragraph in page('#container > p'):
paragraph = PyQuery(paragraph)
text = paragraph.text()
That's a wart the PyQuery creators ported over from jQuery (where they'd fix it if it didn't break compatability). Understandable but still unfortunate for such a great library.