html2plaintext Python script to convert HTML emails to plain text

10 August 2007   12 comments   Python

Powered by Fusion×

From the doc string:

A very spartan attempt of a script that converts HTML to

The original use for this little script was when I send HTML emails out I also
wanted to send a plaintext version of the HTML email as multipart. Instead of 
having two methods for generating the text I decided to focus on the HTML part
first and foremost (considering that a large majority of people don't have a 
problem with HTML emails) and make the fallback (plaintext) created on the fly.

This little script takes a chunk of HTML and strips out everything except the
<body> (or an elemeny ID) and inside that chunk it makes certain conversions 
such as replacing all hyperlinks with footnotes where the URL is shown at the
bottom of the text instead. <strong>words</strong> are converted to *words* 
and it does a fair attempt of getting the linebreaks right.

As a last resort, it strips away all other tags left that couldn't be gracefully
replaced with a plaintext equivalent.
Thanks for Fredrik Lundh's unescape() function things like:
   'Terms &amp;amp; Conditions' is converted to
   'Termss &amp; Conditions'

It's far from perfect but a good start. It works for me for now.

Version at the time of writing this: 0.1.

I wouldn't be surprised if I've reinvented the wheel here but I did plenty of searches and couldn't really find anything like this.

Let's run this for a while until I stumble across some bugs or other inconsistencies which I haven't quite done yet. The one thing I'm really unhappy about is the way I extract the body from the BeautifulSoup parse object. I really couldn't find another better way in the few minutes I had to spare on this.

Feel free to comment on things you think are pressing bugs.

You can download the script here version 0.1


I should take a second look at Aaron Swartz's script the next time I work on this. His script seems a lot more mature and Aaron is brilliant Python developer.


James has a python progrram to do much the same.
Peter Bengtsson
So I did reinvent the wheel like I suspected. Pity. I'll definitely keep this one in mind if mine when I've tested mine for a while.
Cool, thank you! I was looking for something exactly like this a couple of weeks ago!

Your counting for the footnote anchors is inconsistent by the way, as you count both inline and indexed anchors in the first loop. Here's a quick fix, albeit I am sure there is a better one:

Peter Bengtsson
Thanks man! Incorporate now.
Ian Bicking
I wrote something like this some time ago:

It uses HTMLParser, which is kind of crappy, but BS didn't exist at the time.
Peter Bengtsson
I seem to be getting excess linebreaks with that but I'm sure there's a solution to that too. Your script doesn't use footnotes but <url/in/angle/brackets> instead. I'll think about that because that's also a respected "email format".
Ian Bicking
I've never noticed excess linebreaks. I realize sometimes I ignore \r, though; it's possible I eliminate meaningless \n and not \r.

We used lynx originally before I wrote this, but people really didn't like the output and of course there's nothing you can do to control it. For actual emailing we'd render the email template twice, once with an text=True and once with text=False, and then you can tweak things in the template however you want (e.g., leave out some navigation from the text version).
Another link
Also, w3m on linux (and probably lynx and links2) will do the same.
Peter Bengtsson
That's kind of different result but definitely an interesting recipe. Thanks.
lynx -dump -force_html
w3m -dump
Nothing from elinks or links2, last time I checked, but both the above will do a better job of formatting HTML to plaintext than what you've just hacked up. (Still, not too bad.)
Peter Bengtsson
w3m doesn't do the hyperlink footnotes.
Hardik Patadia

I have a text containing "<" and ">" symbols and while displaying it gets converted to "&lt;" and "&gt;". Is there anything which allows to remain the given text as it is, i.e the symbols should remain as "<" and ">". I am working on OpenERP/Odoo having python as scripting language.

Your email will never ever be published

Related posts

YSlow grade A (96) but not with doubts 06 August 2007
rfc822() vs. rfc1123_date() 16 August 2007
Related by keywords:
Sending HTML emails in Zope 26 October 2006