Fastest Python function to slugify a string

Thursday, Sep 12, 2019
4 comments Python

In MDN I noticed a function that turns a piece of text (Python 2 unicode) into a slug. It looks like this:


    non_url_safe = ['"', '#', '$', '%', '&', '+',
                    ',', '/', ':', ';', '=', '?',
                    '@', '[', '\\', ']', '^', '`',
                    '{', '|', '}', '~', "'"]

    def slugify(self, text):
        """
        Turn the text content of a header into a slug for use in an ID
        """
        non_safe = [c for c in text if c in self.non_url_safe]
        if non_safe:
            for c in non_safe:
                text = text.replace(c, '')
        # Strip leading, trailing and multiple whitespace, convert remaining whitespace to _
        text = u'_'.join(text.split())
        return text

The code is 7-8 years old and relates to a migration when MDN was created as a Python fork from an existing PHP solution.

I couldn't help but to react to the fact that it's a list and it's looped over every single time. Twice, in a sense. Python has built-in tools for this kinda stuff. Let's see if I can make it faster.

The candidates


translate_table = {ord(char): u'' for char in non_url_safe}
non_url_safe_regex = re.compile(
    r'[{}]'.format(''.join(re.escape(x) for x in non_url_safe)))


def _slugify1(self, text):
    non_safe = [c for c in text if c in self.non_url_safe]
    if non_safe:
        for c in non_safe:
            text = text.replace(c, '')
    text = u'_'.join(text.split())
    return text

def _slugify2(self, text):
    text = text.translate(self.translate_table)
    text = u'_'.join(text.split())
    return text

def _slugify3(self, text):
    text = self.non_url_safe_regex.sub('', text).strip()
    text = u'_'.join(re.split(r'\s+', text))
    return text

I wrote a thing that would call each one of the candidates, assert that their outputs always match and store how long each one took.

The results

The slowest is fast enough. But if you're still reading, here are the results:

_slugify1 0.101ms
_slugify2 0.019ms
_slugify3 0.033ms

So using a translate table is 5 times faster. And a regex 3 times faster. But they're all sufficiently fast.

Conclusion

This is the least of your problems in a world of real I/O such as databases and other genuinely CPU intense stuff. Well, it was fun little side-trip.

Also, aren't there better solutions that just blacklist all control characters?

Comments

James Bennett September 13, 2019

I remember this one, and I'm the original author of that piece of code.

When first written, the slow looping approach was actually the simplest solution for the underlying problem, which was the specific way the previous wiki engine had encoded section titles for use in HTML IDs. The old wiki would replace these characters with a sequence of hex values of the character's UTF-8 bytes, each preceded by a dot. So a space in a section title, for example, would become '.20' in the generated ID.

At the time that had to be preserved so that existing links to specific sections of MDN documents would continue to work after the move to Django. You can see the original replacement code in the commit that introduced it:

https://github.com/mozilla/kuma/commit/be10b92234bda15a86f98a893b38fc1dce56e1a9

It would have been possible to write a function that transformed only the characters needing encoding, and map() over the input applying that, but the loop approach, while slightly less efficient, seemed clearer and more readable to me (and the extra time it took was more than lost in the noise, anyway; kuma's page rendering was a hugely expensive operation, for a variety of reasons).

Nowadays, it appears MDN no longer enforces the requirement to remain compatible with MindTouch section IDs, so it'd make sense to me to just go ahead and replace this code with a more idiomatic approach like the translation table (and then another tiny piece of code I wrote would vanish out of MDN...).

Peter Bengtsson October 2, 2019

Thank you for posting that! That MindTouch legacy is still lurking about.

I'm still fond of my conclusion (even though it wasn't particularly surprising) that these little details don't actually matter all that much. I/O rules the latency and creating slugs isn't something that needs to be done every couple of milliseconds. Perhaps I blogged about it just to go for a walk.

Anonymous July 16, 2021

To be fair, the `translate_table` creation should be inside the `_slugify2` function, which is the only one that uses it.

In addition, maybe you should use `timeit` to run them more than once.

upp April 11, 2024

Not really, that's unfair because you recreate translate_table everytime you call 2.

Previous:: NodeJS fs walk() or glob or fast-glob August 31, 2019 JavaScript
Next:: uwsgi weirdness with --http September 19, 2019 Python, Linux

Related by category:: A Python dict that can report which keys you did not use June 12, 2025 Python; Combining Django signals with in-memory LRU cache August 9, 2025 Python; Native connection pooling in Django 5 with PostgreSQL June 25, 2025 Python; How I run standalone Python in 2025 January 14, 2025 Python

Related by keyword:: How to resolve a git conflict in poetry.lock February 7, 2020 Python; Simple object lookup in TypeScript June 14, 2024 JavaScript; How to get all of MDN Web Docs running locally June 9, 2021 Web development, MDN; MDN Documents Size Tree Map November 14, 2019 Web development, MDN

Fastest Python function to slugify a string

The candidates

The results

Conclusion

Comments

Related posts