domstripper - A lxml.html test project

Thursday, Nov 20, 2008
1 comment Python

URL: http://pypi.python.org/pypi/domstripper/

I'm just playing with the impressive lxml.html package. It makes it possible to easily work with HTML trees and manipulate them.

I had this crazy idea of a "DOM stripper" that removes all but specified elements from an HTML file. For example you want to keep the contents of the <head> tag intact but you just want to keep the <div id="content">...</div> tag thus omitting <div id="banner">...</div> and <div id="nav">...</div>. domstripper now does that. This can be used for example as a naive proxy that tranforms a bloated HTML page into a more stripped down smaller version suitable for say mobile web browsers. It's more a proof of concept that anything else.

To test you just need a virtual python environment and the right system libs to needed to install lxml. This worked for me:


$ sudo apt-get install cython libxslt1-dev zlib1g-dev libxml2-dev
$ cd /tmp
$ virtualenv --no-site-packages testenv
$ cd testenv
$ source bin/activate
$ easy_install domstripper

Now you can use it like this:


>>> from domstripper import domstripper
>>> help(domstripper)
...
>>> domstripper('bloat.html', ['#content', 'h1.header'])
<!DOCTYPE...
...

Best to just play with it and see if makes sense. I'm not saying this is an amazing package but it goes to show what can be done with lxml.html and the extremely user friendly CSS selectors.

Comments

Ian Bicking November 20, 2008

A couple things:

You also need (but maybe already had) libxml2-dev. You don't need cython if you are installing an lxml release (it comes with the built C files).

You can just parse with lxml.html.parse (it just uses that parser). Also documents have a .body attribute, you don't have to select it (if you use parser, you have to do parser(...).getroot()). You should create new elements using lxml.html.Element, as it will be an HTML element instead of a generic lxml.etree element.

I think there's a way to access the doctype too, but I can't remember (it's probably in .docinfo somewhere) -- you don't have to parse it out yourself I think.

Performance probably doesn't matter here, but precompiling CSSSelectors would be faster (it's a several-stage process to compile them, but once compiled they are very fast). If you don't compile, you can just use .cssselect(selector) instead of actually instantiating one.

Anyway, a few things to consider

Previous:: How to unit test the innards of a Django view function November 15, 2008 Django
Next:: Finally got rid of the system beep November 22, 2008 Linux

Related by category:: Claude Opus is 10x faster than OpenAI GPT 5 at non-streaming completions July 24, 2026 Python; Best Django Redis configuration for speed and size July 19, 2026 Python; How to use a list/tuple/array in Django with a raw SQL cursor July 14, 2026 Python; Using AI to rewrite blog post comments November 12, 2025 Python

domstripper - A lxml.html test project

Comments

Related posts