Fastest way to match a filename's extension in Python

31 August 2017   3 comments   Python

tl;dr; By a slim margin, the fastest way to check a filename matching a list of extensions is filename.endswith(extensions)

This turned out to be premature optimization. The context is that I want to check if a filename matches the file extension in a list of 6.

The list being ['.sym', '.dl_', '.ex_', '.pd_', '.dbg.gz', '.tar.bz2']. Meaning, it should return True for foo.sym or foo.dbg.gz. But it should return False for bar.exe or bar.gz.

I put together a litte benchmark, ran it a bunch of times and looked at the results. Here are the functions I wrote:

def f1(filename):
    for each in extensions:
        if filename.endswith(each):
            return True
    return False


def f2(filename):
    return filename.endswith(extensions_tuple)


regex = re.compile(r'({})$'.format(
    '|'.join(re.escape(x) for x in extensions)
))


def f3(filename):
    return bool(regex.findall(filename))


def f4(filename):
    return bool(regex.search(filename))

The results are boring. But I guess that's a result too:

FUNCTION             MEDIAN               MEAN
f1 9543 times        0.0110ms             0.0116ms
f2 9523 times        0.0031ms             0.0034ms
f3 9560 times        0.0041ms             0.0045ms
f4 9509 times        0.0041ms             0.0043ms

For a list of ~40,000 realistic filenames (with result True 75% of the time), I ran each function 10 times. So, it means it took on average 0.0116ms to run f1 10 times here on my laptop with Python 3.6.

More premature optimization

Upon looking into the data and thinking about this will be used. If I reorder the list of extensions so the most common one is first, second most common second etc. Then the performance improves a bit for f1 but slows down slightly for f3 and f4.

Conclusion

That .endswith(some_tuple) is neat and it's hair-splittingly faster. But really, this turned out to not make a huge difference in the grand scheme of things. On average it takes less than 0.001ms to do one filename match.

Comments

Eric Werner

Whow nice! I didn't even know that `.startswith()/.endswith()` eat tuples!! 👍 Thanks!

But you didn't consider using `os.path.splitext()`? And then compare if in list?
What about lowercasing it before? To match accidentally upper cased extensions?

Peter Bengtsson

os.path.splitext will say the extension is .gz for both foo.tar.gz and foo.gz and I needed it to be more specific.
Lowercasing would be the same across the board.

Yeah, that tuple trick on endswith is nice.

Dmitry Danilov

It helped me to solve problem! It also takes less code that I expected. Thanks!

Your email will never ever be published


Related posts

Previous:
React lifecycle hooks must-have 13 August 2017
Next:
Ultrafast loading of CSS 01 September 2017
Related by Keyword:
A darn good search filter function in JavaScript 12 September 2018
Don't forget your sets in Python! 10 March 2017
Optimization of QuerySet.get() with or without select_related 03 November 2016
How to no-mincss links with django-pipeline 03 February 2016
mozjpeg installation and sample 10 October 2015
Related by Text:
jQuery and Highslide JS 08 January 2008
I'm back! Peterbe.com has been renewed 05 June 2005
Anti-McCain propaganda videos 12 August 2008
Ever wondered how much $87 Billion is? 04 November 2003
Guake, not Yakuake or Yeahconsole 23 January 2010