08 January 2021 2 comments Python
selectolax is best for stripping HTML down to plain text.
The problem is that I have 10,000+ HTML snippets that I need to index into Elasticsearch as plain text. (Before you ask, yes I know Elasticsearch has a
html_strip text filter but it's not what I want/need to use in this context).
Turns out, stripping the HTML into plain text was actually quite expensive at that scale. So what's the most performant way?
from pyquery import PyQuery as pq text = pq(html).text()
from selectolax.parser import HTMLParser text = HTMLParser(html).text()
import re regex = re.compile(r'<.*?>') text = clean_regex.sub('', html)
I wrote a script that iterated through 10,000 files that contains HTML snippets. Note! The snippets aren't complete
<html> documents (with a
<body> etc) Just blobs of HTML. The average size is 10,314 bytes (5,138 bytes median).
pyquery SUM: 18.61 seconds MEAN: 1.8633 ms MEDIAN: 1.0554 ms selectolax SUM: 3.08 seconds MEAN: 0.3149 ms MEDIAN: 0.1621 ms regex SUM: 1.64 seconds MEAN: 0.1613 ms MEDIAN: 0.0881 ms
I've run it a bunch of times. The results are pretty stable.
selectolax is ~7 times faster than
No, I don't think I want to use that. It makes me nervous without even attempting to dig up some examples where it goes wrong. It might work just fine for the most basic blobs of HTML. Actually, if the HTML is
<p>Foo & Bar</p>, I expect the plain text transformation should be
Foo & Bar, not
Foo & Bar.
More pressing, both
supports something very specific but important to my use case. I need to remove certain tags (and its content) before I proceed. For example:
<h4 class="warning">This should get stripped.</h4> <p>Please keep.</p> <div style="display: none">This should also get stripped.</div>
That can never be done with a regex.
So my requirement will probably change but basically, I want to delete certain tags. E.g.
<div class="warning"> and
<div class="hidden"> and
<div style="display: none">. So let's implement that:
from pyquery import PyQuery as pq _display_none_regex = re.compile(r'display:\s*none') doc = pq(html) doc.remove('div.warning, div.hidden') for div in doc('div[style]').items(): style_value = div.attr('style') if _display_none_regex.search(style_value): div.remove() text = doc.text()
from selectolax.parser import HTMLParser _display_none_regex = re.compile(r'display:\s*none') tree = HTMLParser(html) for tag in tree.css('div.warning, div.hidden'): tag.decompose() for tag in tree.css('div[style]'): style_value = tag.attributes['style'] if style_value and _display_none_regex.search(style_value): tag.decompose() text = tree.body.text()
This actually works. When I now run the same benchmark for 10,000 of these are the new results:
pyquery SUM: 21.70 seconds MEAN: 2.1701 ms MEDIAN: 1.3989 ms selectolax SUM: 3.59 seconds MEAN: 0.3589 ms MEDIAN: 0.2184 ms regex Skip
PyQuery by a factor of ~6.
Regular expressions are fast but weak in power. Makes sense.