Comment

Ian Bicking

A couple things:

You also need (but maybe already had) libxml2-dev. You don't need cython if you are installing an lxml release (it comes with the built C files).

You can just parse with lxml.html.parse (it just uses that parser). Also documents have a .body attribute, you don't have to select it (if you use parser, you have to do parser(...).getroot()). You should create new elements using lxml.html.Element, as it will be an HTML element instead of a generic lxml.etree element.

I think there's a way to access the doctype too, but I can't remember (it's probably in .docinfo somewhere) -- you don't have to parse it out yourself I think.

Performance probably doesn't matter here, but precompiling CSSSelectors would be faster (it's a several-stage process to compile them, but once compiled they are very fast). If you don't compile, you can just use .cssselect(selector) instead of actually instantiating one.

Anyway, a few things to consider