10 August 2007 12 comments Python
From the doc string:
A very spartan attempt of a script that converts HTML to plaintext. The original use for this little script was when I send HTML emails out I also wanted to send a plaintext version of the HTML email as multipart. Instead of having two methods for generating the text I decided to focus on the HTML part first and foremost (considering that a large majority of people don't have a problem with HTML emails) and make the fallback (plaintext) created on the fly. This little script takes a chunk of HTML and strips out everything except the <body> (or an elemeny ID) and inside that chunk it makes certain conversions such as replacing all hyperlinks with footnotes where the URL is shown at the bottom of the text instead. <strong>words</strong> are converted to *words* and it does a fair attempt of getting the linebreaks right. As a last resort, it strips away all other tags left that couldn't be gracefully replaced with a plaintext equivalent. Thanks for Fredrik Lundh's unescape() function things like: 'Terms &amp; Conditions' is converted to 'Termss & Conditions' It's far from perfect but a good start. It works for me for now.
Version at the time of writing this: 0.1.
I wouldn't be surprised if I've reinvented the wheel here but I did plenty of searches and couldn't really find anything like this.
Let's run this for a while until I stumble across some bugs or other inconsistencies which I haven't quite done yet. The one thing I'm really unhappy about is the way I extract the body from the BeautifulSoup parse object. I really couldn't find another better way in the few minutes I had to spare on this.
Feel free to comment on things you think are pressing bugs.
You can download the script here html2plaintext.py version 0.1
I should take a second look at Aaron Swartz's html2text.py script the next time I work on this. His script seems a lot more mature and Aaron is brilliant Python developer.
http://www.aaronsw.com/2002/html2text/ has a python progrram to do much the same.
So I did reinvent the wheel like I suspected. Pity. I'll definitely keep this one in mind if mine when I've tested mine for a while.
Cool, thank you! I was looking for something exactly like this a couple of weeks ago!
Your counting for the footnote anchors is inconsistent by the way, as you count both inline and indexed anchors in the first loop. Here's a quick fix, albeit I am sure there is a better one:
Thanks man! Incorporate now.
I wrote something like this some time ago:
It uses HTMLParser, which is kind of crappy, but BS didn't exist at the time.
I seem to be getting excess linebreaks with that but I'm sure there's a solution to that too. Your script doesn't use footnotes but <url/in/angle/brackets> instead. I'll think about that because that's also a respected "email format".
I've never noticed excess linebreaks. I realize sometimes I ignore \r, though; it's possible I eliminate meaningless \n and not \r.
We used lynx originally before I wrote this, but people really didn't like the output and of course there's nothing you can do to control it. For actual emailing we'd render the email template twice, once with an text=True and once with text=False, and then you can tweak things in the template however you want (e.g., leave out some navigation from the text version).
Also, w3m on linux (and probably lynx and links2) will do the same.
That's kind of different result but definitely an interesting recipe. Thanks.
lynx -dump -force_html
Nothing from elinks or links2, last time I checked, but both the above will do a better job of formatting HTML to plaintext than what you've just hacked up. (Still, not too bad.)
w3m doesn't do the hyperlink footnotes.
I have a text containing "<" and ">" symbols and while displaying it gets converted to "<" and ">". Is there anything which allows to remain the given text as it is, i.e the symbols should remain as "<" and ">". I am working on OpenERP/Odoo having python as scripting language.