Fastest way to turn HTML into text in Python

Friday, Jan 8, 2021
4 comments Python

tl;dr; selectolax is best for stripping HTML down to plain text.

The problem is that I have 10,000+ HTML snippets that I need to index into Elasticsearch as plain text. (Before you ask, yes I know Elasticsearch has a html_strip text filter but it's not what I want/need to use in this context).
Turns out, stripping the HTML into plain text was actually quite expensive at that scale. So what's the most performant way?

PyQuery


from pyquery import PyQuery as pq

text = pq(html).text()

selectolax


from selectolax.parser import HTMLParser

text = HTMLParser(html).text()

regular expression


import re

regex = re.compile(r'<.*?>')
text = clean_regex.sub('', html)

Results

I wrote a script that iterated through 10,000 files that contains HTML snippets. Note! The snippets aren't complete <html> documents (with a <head> and <body> etc) Just blobs of HTML. The average size is 10,314 bytes (5,138 bytes median).

pyquery
  SUM:    18.61 seconds
  MEAN:   1.8633 ms
  MEDIAN: 1.0554 ms
selectolax
  SUM:    3.08 seconds
  MEAN:   0.3149 ms
  MEDIAN: 0.1621 ms
regex
  SUM:    1.64 seconds
  MEAN:   0.1613 ms
  MEDIAN: 0.0881 ms

I've run it a bunch of times. The results are pretty stable.

Point is: selectolax is ~7 times faster than PyQuery

Regex? Really?

No, I don't think I want to use that. It makes me nervous without even attempting to dig up some examples where it goes wrong. It might work just fine for the most basic blobs of HTML. Actually, if the HTML is <p>Foo & Bar</p>, I expect the plain text transformation should be Foo & Bar, not Foo & Bar.

More pressing, both PyQuery and selectolax supports something very specific but important to my use case. I need to remove certain tags (and its content) before I proceed. For example:


<h4 class="warning">This should get stripped.</h4>
<p>Please keep.</p>
<div style="display: none">This should also get stripped.</div>

That can never be done with a regex.

Version 2.0

So my requirement will probably change but basically, I want to delete certain tags. E.g. <div class="warning"> and <div class="hidden"> and <div style="display: none">. So let's implement that:

PyQuery


from pyquery import PyQuery as pq

_display_none_regex = re.compile(r'display:\s*none')

doc = pq(html)
doc.remove('div.warning, div.hidden')
for div in doc('div[style]').items():
    style_value = div.attr('style')
    if _display_none_regex.search(style_value):
        div.remove()
text = doc.text()

selectolax


from selectolax.parser import HTMLParser

_display_none_regex = re.compile(r'display:\s*none')

tree = HTMLParser(html)
for tag in tree.css('div.warning, div.hidden'):
    tag.decompose()
for tag in tree.css('div[style]'):
    style_value = tag.attributes['style']
    if style_value and _display_none_regex.search(style_value):
        tag.decompose()
text = tree.body.text()

This actually works. When I now run the same benchmark for 10,000 of these are the new results:

pyquery
  SUM:    21.70 seconds
  MEAN:   2.1701 ms
  MEDIAN: 1.3989 ms
selectolax
  SUM:    3.59 seconds
  MEAN:   0.3589 ms
  MEDIAN: 0.2184 ms
regex
  Skip

Again, selectolax beats PyQuery by a factor of ~6.

Conclusion

Regular expressions are fast but weak in power. Makes sense.

This selectolax is very impressive.
I got the inspiration from this blog post which sets out to do something very similar to what I'm doing.

I hope this helps someone. Thank you Artem Golubin of selectolax and @lexborisov for Modest which selectolax is built upon.

Comments

Gregory J. Baker January 9, 2021

Beg to disagree with your conclusion: "Regular expressions are ... weak in power"

html_text = '''<h4 class="warning">This should get stripped.</h4>
<p>Please keep.</p>
<p>Foo & Bar</p>
<div style="display: none">This should also get stripped.</div>'''

warnings = '<.*(warning){1}.*>'
no_display = '<.*(display: none){1}.*>'
disp_warn = re.compile(rf('{no_display}|{warnings}')
html_tags = re.compile(r'<.*?>')

clean_txt = html.unescape(html_tags.sub('', disp_warn.sub('', html_text)))

Will give you:

Please keep.
Foo & Bar

Peter Bengtsson January 11, 2021

But would you use it in a production application where the HTML isn't perfectly pure?

Anonymous July 28, 2022

There's a reason for the famous Jamie Zawinski quote: 'Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.'

zehawk February 8, 2025

Hello, what about BeautifulSoup get_text? Could you pls add to your analysis - why or why not. I understand that its slower than selectolax but pretty much bullet proof.

Previous:: Gcm - git checkout master or main December 21, 2020 Python
Next:: useSearchParams as a React global state manager February 1, 2021 React, JavaScript

Related by category:: Native connection pooling in Django 5 with PostgreSQL June 25, 2025 Python; A Python dict that can report which keys you did not use June 12, 2025 Python; How I run standalone Python in 2025 January 14, 2025 Python; How to resolve a git conflict in poetry.lock February 7, 2020 Python

Related by keyword:: How much faster is Cheerio at parsing depending on xmlMode? December 5, 2022 Node, JavaScript; mincss "Clears the junk out of your CSS" January 21, 2013 Python, Web development; Build an XML sitemap of XML sitemaps June 1, 2019 Python, Django; A darn good search filter function in JavaScript September 12, 2018 Web development, JavaScript