Beg to disagree with your conclusion: "Regular expressions are ... weak in power"
html_text = '''<h4 class="warning">This should get stripped.</h4> <p>Please keep.</p> <p>Foo & Bar</p> <div style="display: none">This should also get stripped.</div>'''
Comment
Beg to disagree with your conclusion: "Regular expressions are ... weak in power"
html_text = '''<h4 class="warning">This should get stripped.</h4>
<p>Please keep.</p>
<p>Foo & Bar</p>
<div style="display: none">This should also get stripped.</div>'''
warnings = '<.*(warning){1}.*>'
no_display = '<.*(display: none){1}.*>'
disp_warn = re.compile(rf('{no_display}|{warnings}')
html_tags = re.compile(r'<.*?>')
clean_txt = html.unescape(html_tags.sub('', disp_warn.sub('', html_text)))
Will give you:
Please keep.
Foo & Bar
Replies
But would you use it in a production application where the HTML isn't perfectly pure?