tl;dr; By a slim margin, the fastest way to check a filename matching a list of extensions is filename.endswith(extensions)

This turned out to be premature optimization. The context is that I want to check if a filename matches the file extension in a list of 6.

The list being ['.sym', '.dl_', '.ex_', '.pd_', '.dbg.gz', '.tar.bz2']. Meaning, it should return True for foo.sym or foo.dbg.gz. But it should return False for bar.exe or bar.gz.

I put together a litte benchmark, ran it a bunch of times and looked at the results. Here are the functions I wrote:


def f1(filename):
    for each in extensions:
        if filename.endswith(each):
            return True
    return False


def f2(filename):
    return filename.endswith(extensions_tuple)


regex = re.compile(r'({})$'.format(
    '|'.join(re.escape(x) for x in extensions)
))


def f3(filename):
    return bool(regex.findall(filename))


def f4(filename):
    return bool(regex.search(filename))

The results are boring. But I guess that's a result too:

FUNCTION             MEDIAN               MEAN
f1 9543 times        0.0110ms             0.0116ms
f2 9523 times        0.0031ms             0.0034ms
f3 9560 times        0.0041ms             0.0045ms
f4 9509 times        0.0041ms             0.0043ms

For a list of ~40,000 realistic filenames (with result True 75% of the time), I ran each function 10 times. So, it means it took on average 0.0116ms to run f1 10 times here on my laptop with Python 3.6.

More premature optimization

Upon looking into the data and thinking about this will be used. If I reorder the list of extensions so the most common one is first, second most common second etc. Then the performance improves a bit for f1 but slows down slightly for f3 and f4.

Conclusion

That .endswith(some_tuple) is neat and it's hair-splittingly faster. But really, this turned out to not make a huge difference in the grand scheme of things. On average it takes less than 0.001ms to do one filename match.

Comments

Eric Werner

Whow nice! I didn't even know that `.startswith()/.endswith()` eat tuples!! 👍 Thanks!

But you didn't consider using `os.path.splitext()`? And then compare if in list?
What about lowercasing it before? To match accidentally upper cased extensions?

Peter Bengtsson

os.path.splitext will say the extension is .gz for both foo.tar.gz and foo.gz and I needed it to be more specific.
Lowercasing would be the same across the board.

Yeah, that tuple trick on endswith is nice.

Dmitry Danilov

It helped me to solve problem! It also takes less code that I expected. Thanks!

Kradak Thomas

Great solution. An extended problem seeks to process files ending in .xlsx, .xlsm, .xltm, .xltx with my list value having items ('xls', 'xlt') or even (.xl). My thoughts are do it in two steps: (1) you use .endswith for the simple hits, then (2) take a pass on my problem set, whatever the solution is.

Your email will never ever be published.

Related posts