10 August 2007 12 comments Python
From the doc string:
A very spartan attempt of a script that converts HTML to plaintext. The original use for this little script was when I send HTML emails out I also wanted to send a plaintext version of the HTML email as multipart. Instead of having two methods for generating the text I decided to focus on the HTML part first and foremost (considering that a large majority of people don't have a problem with HTML emails) and make the fallback (plaintext) created on the fly. This little script takes a chunk of HTML and strips out everything except the <body> (or an elemeny ID) and inside that chunk it makes certain conversions such as replacing all hyperlinks with footnotes where the URL is shown at the bottom of the text instead. <strong>words</strong> are converted to *words* and it does a fair attempt of getting the linebreaks right. As a last resort, it strips away all other tags left that couldn't be gracefully replaced with a plaintext equivalent. Thanks for Fredrik Lundh's unescape() function things like: 'Terms &amp; Conditions' is converted to 'Termss & Conditions' It's far from perfect but a good start. It works for me for now.
Version at the time of writing this: 0.1.
I wouldn't be surprised if I've reinvented the wheel here but I did plenty of searches and couldn't really find anything like this.
Let's run this for a while until I stumble across some bugs or other inconsistencies which I haven't quite done yet. The one thing I'm really unhappy about is the way I extract the body from the BeautifulSoup parse object. I really couldn't find another better way in the few minutes I had to spare on this.
Feel free to comment on things you think are pressing bugs.
You can download the script here html2plaintext.py version 0.1
I should take a second look at Aaron Swartz's html2text.py script the next time I work on this. His script seems a lot more mature and Aaron is brilliant Python developer.