Comment

Gregory J. Baker

Beg to disagree with your conclusion: "Regular expressions are ... weak in power"

html_text = '''<h4 class="warning">This should get stripped.</h4>
<p>Please keep.</p>
<p>Foo & Bar</p>
<div style="display: none">This should also get stripped.</div>'''

warnings = '<.*(warning){1}.*>'
no_display = '<.*(display: none){1}.*>'
disp_warn = re.compile(rf('{no_display}|{warnings}')
html_tags = re.compile(r'<.*?>')

clean_txt = html.unescape(html_tags.sub('', disp_warn.sub('', html_text)))

Will give you:

Please keep.
Foo & Bar

Replies

Peter Bengtsson

But would you use it in a production application where the HTML isn't perfectly pure?