From the doc string:
A very spartan attempt of a script that converts HTML to
plaintext.
The original use for this little script was when I send HTML emails out I also
wanted to send a plaintext version of the HTML email as multipart. Instead of
having two methods for generating the text I decided to focus on the HTML part
first and foremost (considering that a large majority of people don't have a
problem with HTML emails) and make the fallback (plaintext) created on the fly.
This little script takes a chunk of HTML and strips out everything except the
<body> (or an elemeny ID) and inside that chunk it makes certain conversions
such as replacing all hyperlinks with footnotes where the URL is shown at the
bottom of the text instead. <strong>words</strong> are converted to *words*
and it does a fair attempt of getting the linebreaks right.
As a last resort, it strips away all other tags left that couldn't be gracefully
replaced with a plaintext equivalent.
Thanks for Fredrik Lundh's unescape() function things like:
'Terms &amp; Conditions' is converted to
'Termss & Conditions'
It's far from perfect but a good start. It works for me for now.
Version at the time of writing this: 0.1.
I wouldn't be surprised if I've reinvented the wheel here but I did plenty of searches and couldn't really find anything like this.
Let's run this for a while until I stumble across some bugs or other inconsistencies which I haven't quite done yet. The one thing I'm really unhappy about is the way I extract the body from the BeautifulSoup parse object. I really couldn't find another better way in the few minutes I had to spare on this.
Feel free to comment on things you think are pressing bugs.
You can download the script here html2plaintext.py version 0.1
UPDATE
I should take a second look at Aaron Swartz's html2text.py script the next time I work on this. His script seems a lot more mature and Aaron is brilliant Python developer.
Comments
Post your own commenthttp://www.aaronsw.com/2002/html2text/ has a python progrram to do much the same.
So I did reinvent the wheel like I suspected. Pity. I'll definitely keep this one in mind if mine when I've tested mine for a while.
Cool, thank you! I was looking for something exactly like this a couple of weeks ago!
Your counting for the footnote anchors is inconsistent by the way, as you count both inline and indexed anchors in the first loop. Here's a quick fix, albeit I am sure there is a better one:
http://dpaste.com/16572/
Cheers,
Philipp
Thanks man! Incorporate now.
I wrote something like this some time ago:
http://svn.w4py.org/ZPTKit/trunk/ZPTKit/htmlrender.py
It uses HTMLParser, which is kind of crappy, but BS didn't exist at the time.
I seem to be getting excess linebreaks with that but I'm sure there's a solution to that too. Your script doesn't use footnotes but <url/in/angle/brackets> instead. I'll think about that because that's also a respected "email format".
I've never noticed excess linebreaks. I realize sometimes I ignore \r, though; it's possible I eliminate meaningless \n and not \r.
We used lynx originally before I wrote this, but people really didn't like the output and of course there's nothing you can do to control it. For actual emailing we'd render the email template twice, once with an text=True and once with text=False, and then you can tweak things in the template however you want (e.g., leave out some navigation from the text version).
Another link
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52297
Also, w3m on linux (and probably lynx and links2) will do the same.
That's kind of different result but definitely an interesting recipe. Thanks.
lynx -dump -force_html
w3m -dump
Nothing from elinks or links2, last time I checked, but both the above will do a better job of formatting HTML to plaintext than what you've just hacked up. (Still, not too bad.)
w3m doesn't do the hyperlink footnotes.
Hi,
I have a text containing "<" and ">" symbols and while displaying it gets converted to "<" and ">". Is there anything which allows to remain the given text as it is, i.e the symbols should remain as "<" and ">". I am working on OpenERP/Odoo having python as scripting language.