>>> import bleach >>> bleach.linkify("Here is some text with a url.com.") 'Here is some text with a <a href="http://url.com" rel="nofollow">url.com</a>.'
Note that sanitizing is separate thing, but if you're curious, consider this example:
>>> bleach.linkify(bleach.clean("Here is <script> some text with a url.com.")) 'Here is <script> some text with a <a href="http://url.com" rel="nofollow">url.com</a>.'
With that output you can confidently template interpolate that string straight into your HTML.
That's a great start but I wanted a more. For one, I don't always want the
rel="nofollow" attribute on all links. In particular for links that are within the site. Secondly, a lot of things look like a domain but isn't. For example
This is a text.at the start which would naively become...:
>>> bleach.linkify("This is a text.at the start") 'This is a <a href="http://text.at" rel="nofollow">text.at</a> the start'
text.at looks like a domain.
So here is how I use it here on www.peterbe.com to linkify blog comments:
def custom_nofollow_maker(attrs, new=False): href_key = (None, u"href") if href_key not in attrs: return attrs if attrs[href_key].startswith(u"mailto:"): return attrs p = urlparse(attrs[href_key]) if p.netloc not in settings.NOFOLLOW_EXCEPTIONS: # Before we add the `rel="nofollow"` let's first check that this is a # valid domain at all. root_url = p.scheme + "://" + p.netloc try: response = requests.head(root_url) if response.status_code == 301: redirect_p = urlparse(response.headers["location"]) # If the only difference is that it redirects to https instead # of http, then amend the href. if ( redirect_p.scheme == "https" and p.scheme == "http" and p.netloc == redirect_p.netloc ): attrs[href_key] = attrs[href_key].replace("http://", "https://") except ConnectionError: return None rel_key = (None, u"rel") rel_values = [val for val in attrs.get(rel_key, "").split(" ") if val] if "nofollow" not in [rel_val.lower() for rel_val in rel_values]: rel_values.append("nofollow") attrs[rel_key] = " ".join(rel_values) return attrs html = bleach.linkify(text, callbacks=[custom_nofollow_maker])
This basically taking the default
nofollow callback and extending it a bit.
By the way, here is the complete code I use for sanitizing and linkifying blog comments here on this site:
This is slow because it requires network IO every time a piece of text needs to be linkified (if it has domain looking things in it) but that's best alleviated by only doing it once and either caching it or persistently storing the cleaned and rendered output.
Also, the check uses
try: requests.head() except requests.exceptions.ConnectionError: as the method to see if the domain works. I considered doing a whois lookup or something but that felt a little wrong because just because a domain exists doesn't mean there's a website there. Either way, it could be that the domain/URL is perfectly fine but in that very unlucky instant you checked your own server's internet or some other DNS lookup thing is busted. Perhaps wrapping it in a retry and doing
try: requests.head() except requests.exceptions.RetryError: instead.
Lastly, the business logic I chose was to rewrite all
https:// only if the URL
http://domain does a 301 redirect to
https://domain. So if the original link was
http://bit.ly/redirect-slug it leaves it as is. Perhaps a fancier version would be to look at the domain name ending. For example
HEAD http://google.com 301 redirects to
https://www.google.com so you could use the fact that
UPDATE Oct 10 2018
Moments after publishing this, I discovered a bug where it would fail badly if the text contained a URL with an ampersand in it. Turns out, it was a known bug in Bleach. It only happens when you try to pass a filter to the
So I simplified my code and now things work. Apparently, using
bleach.Cleaner(filters=[...]) is faster so I'm losing that. But, for now, that's OK in my context.
Also, in another later fix, I improved the function some more by avoiding non-HTTP links (with the exception of
tel:). Otherwise it would attempt to run
requests.head('ssh://server.example.com') which doesn't make sense.