tl;dr; selectolax
is best for stripping HTML down to plain text.
The problem is that I have 10,000+ HTML snippets that I need to index into Elasticsearch as plain text. (Before you ask, yes I know Elasticsearch has a html_strip
text filter but it's not what I want/need to use in this context).
Turns out, stripping the HTML into plain text was actually quite expensive at that scale. So what's the most performant way?
from pyquery import PyQuery as pq
text = pq(html).text()
from selectolax.parser import HTMLParser
text = HTMLParser(html).text()
regular expression
import re
regex = re.compile(r'<.*?>')
text = clean_regex.sub('', html)
Results
I wrote a script that iterated through 10,000 files that contains HTML snippets. Note! The snippets aren't complete <html>
documents (with a <head>
and <body>
etc) Just blobs of HTML. The average size is 10,314 bytes (5,138 bytes median).
pyquery SUM: 18.61 seconds MEAN: 1.8633 ms MEDIAN: 1.0554 ms selectolax SUM: 3.08 seconds MEAN: 0.3149 ms MEDIAN: 0.1621 ms regex SUM: 1.64 seconds MEAN: 0.1613 ms MEDIAN: 0.0881 ms
I've run it a bunch of times. The results are pretty stable.
Point is: selectolax
is ~7 times faster than PyQuery
Regex? Really?
No, I don't think I want to use that. It makes me nervous without even attempting to dig up some examples where it goes wrong. It might work just fine for the most basic blobs of HTML. Actually, if the HTML is <p>Foo & Bar</p>
, I expect the plain text transformation should be Foo & Bar
, not Foo & Bar
.
More pressing, both PyQuery
and selectolax
supports something very specific but important to my use case. I need to remove certain tags (and its content) before I proceed. For example:
<h4 class="warning">This should get stripped.</h4>
<p>Please keep.</p>
<div style="display: none">This should also get stripped.</div>
That can never be done with a regex.
Version 2.0
So my requirement will probably change but basically, I want to delete certain tags. E.g. <div class="warning">
and <div class="hidden">
and <div style="display: none">
. So let's implement that:
from pyquery import PyQuery as pq
_display_none_regex = re.compile(r'display:\s*none')
doc = pq(html)
doc.remove('div.warning, div.hidden')
for div in doc('div[style]').items():
style_value = div.attr('style')
if _display_none_regex.search(style_value):
div.remove()
text = doc.text()
from selectolax.parser import HTMLParser
_display_none_regex = re.compile(r'display:\s*none')
tree = HTMLParser(html)
for tag in tree.css('div.warning, div.hidden'):
tag.decompose()
for tag in tree.css('div[style]'):
style_value = tag.attributes['style']
if style_value and _display_none_regex.search(style_value):
tag.decompose()
text = tree.body.text()
This actually works. When I now run the same benchmark for 10,000 of these are the new results:
pyquery SUM: 21.70 seconds MEAN: 2.1701 ms MEDIAN: 1.3989 ms selectolax SUM: 3.59 seconds MEAN: 0.3589 ms MEDIAN: 0.2184 ms regex Skip
Again, selectolax
beats PyQuery
by a factor of ~6.
Conclusion
Regular expressions are fast but weak in power. Makes sense.
This selectolax
is very impressive.
I got the inspiration from this blog post which sets out to do something very similar to what I'm doing.
I hope this helps someone. Thank you Artem Golubin of selectolax
and @lexborisov for Modest
which selectolax
is built upon.
Comments
Beg to disagree with your conclusion: "Regular expressions are ... weak in power"
html_text = '''<h4 class="warning">This should get stripped.</h4>
<p>Please keep.</p>
<p>Foo & Bar</p>
<div style="display: none">This should also get stripped.</div>'''
warnings = '<.*(warning){1}.*>'
no_display = '<.*(display: none){1}.*>'
disp_warn = re.compile(rf('{no_display}|{warnings}')
html_tags = re.compile(r'<.*?>')
clean_txt = html.unescape(html_tags.sub('', disp_warn.sub('', html_text)))
Will give you:
Please keep.
Foo & Bar
But would you use it in a production application where the HTML isn't perfectly pure?
There's a reason for the famous Jamie Zawinski quote: 'Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.'