pip install pyquery
For parsing HTML in Python, Beautiful Soup
is oft recommended and it does a great job. It sports a good pythonic
API and it's easy to find introductory guides on the web. All is good in
parsing-land .. until you want to parse more than a dozen documents at a
time and immediately run head-first into performance problems. It's -
simply put - very, very slow.
Just how slow? Check out this chart from the excellent Python HTML Parser comparison Ian Bicking compiled in 2008:
What immediately stands out is how fast lxml is. Compared to
Beautiful Soup, the lxml docs are pretty sparse and that's what
originally kept me from adopting this mustang of a parsing library. lxml
is pretty clunky to use. Yeah you can learn and use Xpath or cssselect
to select specific elements out of the tree and it becomes kind of
tolerable. But once you've selected the elements that you actually want
to get, you have to navigate the labyrinth of attributes lxml exposes,
some containing the bits you want to get at, but the vast majority just
returning None. This becomes easier after a couple dozen uses but it remains unintuitive.
So either slow and easy to use or fast and hard to use, right?
Oh PyQuery you beautiful seductress:
from pyquery import PyQuery
page = PyQuery(some_html)
last_red_anchor = page('#container > a.red:last')
Easy as pie. It's ever-beloved jQuery but in Python!
There are some gotchas, like for example that PyQuery, like jQuery,
exposes its internals upon iteration, forcing you to re-wrap:
for paragraph in page('#container > p'):
paragraph = PyQuery(paragraph)
text = paragraph.text()
That's a wart the PyQuery creators ported over from jQuery (where
they'd fix it if it didn't break compatability). Understandable but
still unfortunate for such a great library.