Fancy linkifying of text with Bleach and domain checks (with Python)

10 October 2018   2 comments   Python, Web development

Bleach is awesome. Thank you for it @willkg! It's a Python library for sanitizing text as well as "linkifying" text for HTML use. For example, consider this:

>>> import bleach
>>> bleach.linkify("Here is some text with a url.com.")
'Here is some text with a <a href="http://url.com" rel="nofollow">url.com</a>.'

Note that sanitizing is separate thing, but if you're curious, consider this example:

>>> bleach.linkify(bleach.clean("Here is <script> some text with a url.com."))
'Here is &lt;script&gt; some text with a <a href="http://url.com" rel="nofollow">url.com</a>.'

With that output you can confidently template interpolate that string straight into your HTML.

Getting fancy

That's a great start but I wanted a more. For one, I don't always want the rel="nofollow" attribute on all links. In particular for links that are within the site. Secondly, a lot of things look like a domain but isn't. For example This is a text.at the start which would naively become...:

>>> bleach.linkify("This is a text.at the start")
'This is a <a href="http://text.at" rel="nofollow">text.at</a> the start'

...because text.at looks like a domain.

So here is how I use it here on www.peterbe.com to linkify blog comments:

def custom_nofollow_maker(attrs, new=False):
    href_key = (None, u"href")

    if href_key not in attrs:
        return attrs

    if attrs[href_key].startswith(u"mailto:"):
        return attrs

    p = urlparse(attrs[href_key])
    if p.netloc not in settings.NOFOLLOW_EXCEPTIONS:
        # Before we add the `rel="nofollow"` let's first check that this is a
        # valid domain at all.
        root_url = p.scheme + "://" + p.netloc
        try:
            response = requests.head(root_url)
            if response.status_code == 301:
                redirect_p = urlparse(response.headers["location"])
                # If the only difference is that it redirects to https instead
                # of http, then amend the href.
                if (
                    redirect_p.scheme == "https"
                    and p.scheme == "http"
                    and p.netloc == redirect_p.netloc
                ):
                    attrs[href_key] = attrs[href_key].replace("http://", "https://")

        except ConnectionError:
            return None

        rel_key = (None, u"rel")
        rel_values = [val for val in attrs.get(rel_key, "").split(" ") if val]
        if "nofollow" not in [rel_val.lower() for rel_val in rel_values]:
            rel_values.append("nofollow")
        attrs[rel_key] = " ".join(rel_values)

    return attrs

html = bleach.linkify(text, callbacks=[custom_nofollow_maker])

This basically taking the default nofollow callback and extending it a bit.

By the way, here is the complete code I use for sanitizing and linkifying blog comments here on this site: render_comment_text.

Caveats

This is slow because it requires network IO every time a piece of text needs to be linkified (if it has domain looking things in it) but that's best alleviated by only doing it once and either caching it or persistently storing the cleaned and rendered output.

Also, the check uses try: requests.head() except requests.exceptions.ConnectionError: as the method to see if the domain works. I considered doing a whois lookup or something but that felt a little wrong because just because a domain exists doesn't mean there's a website there. Either way, it could be that the domain/URL is perfectly fine but in that very unlucky instant you checked your own server's internet or some other DNS lookup thing is busted. Perhaps wrapping it in a retry and doing try: requests.head() except requests.exceptions.RetryError: instead.

Lastly, the business logic I chose was to rewrite all http:// to https:// only if the URL http://domain does a 301 redirect to https://domain. So if the original link was http://bit.ly/redirect-slug it leaves it as is. Perhaps a fancier version would be to look at the domain name ending. For example HEAD http://google.com 301 redirects to https://www.google.com so you could use the fact that "www.google.com".endswith("google.com").

UPDATE Oct 10 2018

Moments after publishing this, I discovered a bug where it would fail badly if the text contained a URL with an ampersand in it. Turns out, it was a known bug in Bleach. It only happens when you try to pass a filter to the bleach.Cleaner() class.

So I simplified my code and now things work. Apparently, using bleach.Cleaner(filters=[...]) is faster so I'm losing that. But, for now, that's OK in my context.

Also, in another later fix, I improved the function some more by avoiding non-HTTP links (with the exception of mailto: and tel:). Otherwise it would attempt to run requests.head('ssh://server.example.com') which doesn't make sense.

Comments

René Dudfield
Very nice.

I've been working on something like this recently, and experiencing similar problems.

Have you considered checking the url for spaminess and malware somehow? Especially after manually reviewing lots of submitted urls, and when old urls get thier domains taken over by other dodgey websites. There's a few services around for this, but I'm not sure which is good. The obvious one is the Google Safe Browsing service, which firefox used (or used to?) https://developers.google.com/safe-browsing/

Speaking of domains getting taken over, or going down... having a link to archive.org is kind of nice. I haven't implemented this, but an easy way might be to put an archive icon link next to every link. Not sure how to check if 'the url is sort of what it should be', because content and site design can change.

Expansion of shortened urls, and https checking would be good (there's still lots of sites with broken https). The link shorteners are often trackers. But at least now many of the mappings are tracked by 301works: https://archive.org/details/301works&tab=about So now when link shorteners stop working, or are changed, it should be possible to find out where things went to before it changed.

For comments, I've trained a spam classifier with a bunch of blog comments collected over the years. Additionally, putting limits on how many comments can be posted per hour stops bots from hammering the commenting endpoints has helped a lot. Finally, having moderation tools for site admins to mark comments as spam/not spam has helped.
Peter Bengtsson
In short, the world is your oyster when you have a tool like this. But one thing you can definitely do is train a spam classifier separately just on the URLs. That way, if the spammers use "good words" around spamming URLs you can catch them based on the URLs.

On this my own blog I manually moderate all comments. It sucks but it's relatively quick. The blue links stand out and alert me to take a closer look.

Your email will never ever be published


Related posts

Previous:
The ideal number of workers in Jest 08 October 2018
Next:
Switching from AWS S3 (boto3) to Google Cloud Storage (google-cloud-storage) in Python 12 October 2018
Related by Keyword:
Best practice with retries with requests 19 April 2017
Fastest way to download a file from S3 29 March 2017
Cope with JSONDecodeError in requests.get().json() in Python 2 and 3 16 November 2016
How to track Google Analytics pageviews on non-web requests (with Python) 03 May 2016
JetBlue a good and bad website 22 February 2007
Related by Text:
jQuery and Highslide JS 08 January 2008
I'm back! Peterbe.com has been renewed 05 June 2005
Anti-McCain propaganda videos 12 August 2008
Ever wondered how much $87 Billion is? 04 November 2003
Guake, not Yakuake or Yeahconsole 23 January 2010