From the doc string:

A very spartan attempt of a script that converts HTML to
plaintext.

The original use for this little script was when I send HTML emails out I also
wanted to send a plaintext version of the HTML email as multipart. Instead of 
having two methods for generating the text I decided to focus on the HTML part
first and foremost (considering that a large majority of people don't have a 
problem with HTML emails) and make the fallback (plaintext) created on the fly.

This little script takes a chunk of HTML and strips out everything except the
<body> (or an elemeny ID) and inside that chunk it makes certain conversions 
such as replacing all hyperlinks with footnotes where the URL is shown at the
bottom of the text instead. <strong>words</strong> are converted to *words* 
and it does a fair attempt of getting the linebreaks right.

As a last resort, it strips away all other tags left that couldn't be gracefully
replaced with a plaintext equivalent.
Thanks for Fredrik Lundh's unescape() function things like:
   'Terms &amp;amp; Conditions' is converted to
   'Termss &amp; Conditions'

It's far from perfect but a good start. It works for me for now.

Version at the time of writing this: 0.1.

I wouldn't be surprised if I've reinvented the wheel here but I did plenty of searches and couldn't really find anything like this.

Let's run this for a while until I stumble across some bugs or other inconsistencies which I haven't quite done yet. The one thing I'm really unhappy about is the way I extract the body from the BeautifulSoup parse object. I really couldn't find another better way in the few minutes I had to spare on this.

Feel free to comment on things you think are pressing bugs.

You can download the script here html2plaintext.py version 0.1

UPDATE

I should take a second look at Aaron Swartz's html2text.py script the next time I work on this. His script seems a lot more mature and Aaron is brilliant Python developer.

James - 10 August 2007 [«« Reply to this]
http://www.aaronsw.com/2002/html2text/ has a python progrram to do much the same.
Peter Bengtsson - 11 August 2007 [«« Reply to this]
So I did reinvent the wheel like I suspected. Pity. I'll definitely keep this one in mind if mine when I've tested mine for a while.
Philipp - 10 August 2007 [«« Reply to this]
Cool, thank you! I was looking for something exactly like this a couple of weeks ago!

Your counting for the footnote anchors is inconsistent by the way, as you count both inline and indexed anchors in the first loop. Here's a quick fix, albeit I am sure there is a better one:
http://dpaste.com/16572/

Cheers,
Philipp
Peter Bengtsson - 11 August 2007 [«« Reply to this]
Thanks man! Incorporate now.
Ian Bicking - 10 August 2007 [«« Reply to this]
I wrote something like this some time ago:

http://svn.w4py.org/ZPTKit/trunk/ZPTKit/htmlrender.py

It uses HTMLParser, which is kind of crappy, but BS didn't exist at the time.
Peter Bengtsson - 11 August 2007 [«« Reply to this]
I seem to be getting excess linebreaks with that but I'm sure there's a solution to that too. Your script doesn't use footnotes but <url/in/angle/brackets> instead. I'll think about that because that's also a respected "email format".
Ian Bicking - 11 August 2007 [«« Reply to this]
I've never noticed excess linebreaks. I realize sometimes I ignore \r, though; it's possible I eliminate meaningless \n and not \r.

We used lynx originally before I wrote this, but people really didn't like the output and of course there's nothing you can do to control it. For actual emailing we'd render the email template twice, once with an text=True and once with text=False, and then you can tweak things in the template however you want (e.g., leave out some navigation from the text version).
DW - 10 August 2007 [«« Reply to this]
Another link
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52297
Also, w3m on linux (and probably lynx and links2) will do the same.
Peter Bengtsson - 11 August 2007 [«« Reply to this]
That's kind of different result but definitely an interesting recipe. Thanks.
ephemient - 11 August 2007 [«« Reply to this]
lynx -dump -force_html
w3m -dump
Nothing from elinks or links2, last time I checked, but both the above will do a better job of formatting HTML to plaintext than what you've just hacked up. (Still, not too bad.)
Peter Bengtsson - 11 August 2007 [«« Reply to this]
w3m doesn't do the hyperlink footnotes.


Your email will never ever be published