Fastest way to match a filename's extension in Python

Thursday, Aug 31, 2017
4 comments Python

tl;dr; By a slim margin, the fastest way to check a filename matching a list of extensions is filename.endswith(extensions)

This turned out to be premature optimization. The context is that I want to check if a filename matches the file extension in a list of 6.

The list being ['.sym', '.dl_', '.ex_', '.pd_', '.dbg.gz', '.tar.bz2']. Meaning, it should return True for foo.sym or foo.dbg.gz. But it should return False for bar.exe or bar.gz.

I put together a litte benchmark, ran it a bunch of times and looked at the results. Here are the functions I wrote:


def f1(filename):
    for each in extensions:
        if filename.endswith(each):
            return True
    return False


def f2(filename):
    return filename.endswith(extensions_tuple)


regex = re.compile(r'({})$'.format(
    '|'.join(re.escape(x) for x in extensions)
))


def f3(filename):
    return bool(regex.findall(filename))


def f4(filename):
    return bool(regex.search(filename))

The results are boring. But I guess that's a result too:

FUNCTION             MEDIAN               MEAN
f1 9543 times        0.0110ms             0.0116ms
f2 9523 times        0.0031ms             0.0034ms
f3 9560 times        0.0041ms             0.0045ms
f4 9509 times        0.0041ms             0.0043ms

For a list of ~40,000 realistic filenames (with result True 75% of the time), I ran each function 10 times. So, it means it took on average 0.0116ms to run f1 10 times here on my laptop with Python 3.6.

More premature optimization

Upon looking into the data and thinking about this will be used. If I reorder the list of extensions so the most common one is first, second most common second etc. Then the performance improves a bit for f1 but slows down slightly for f3 and f4.

Conclusion

That .endswith(some_tuple) is neat and it's hair-splittingly faster. But really, this turned out to not make a huge difference in the grand scheme of things. On average it takes less than 0.001ms to do one filename match.

Comments

Eric Werner September 4, 2017

Whow nice! I didn't even know that `.startswith()/.endswith()` eat tuples!! 👍 Thanks!

But you didn't consider using `os.path.splitext()`? And then compare if in list?
What about lowercasing it before? To match accidentally upper cased extensions?

Peter Bengtsson September 4, 2017

os.path.splitext will say the extension is .gz for both foo.tar.gz and foo.gz and I needed it to be more specific.
Lowercasing would be the same across the board.

Yeah, that tuple trick on endswith is nice.

Dmitry Danilov October 10, 2018

It helped me to solve problem! It also takes less code that I expected. Thanks!

Kradak Thomas April 15, 2021

Great solution. An extended problem seeks to process files ending in .xlsx, .xlsm, .xltm, .xltx with my list value having items ('xls', 'xlt') or even (.xl). My thoughts are do it in two steps: (1) you use .endswith for the simple hits, then (2) take a pass on my problem set, whatever the solution is.

Previous:: React lifecycle hooks must-have August 13, 2017 Web development, React, JavaScript
Next:: Ultrafast loading of CSS September 1, 2017 Web development, JavaScript

Related by category:: How I run standalone Python in 2025 January 14, 2025 Python; get in JavaScript is the same as property in Python February 13, 2025 Python; How to resolve a git conflict in poetry.lock February 7, 2020 Python; Best practice with retries with requests April 19, 2017 Python

Related by keyword:: Fastest way to uniqify a list in Python August 14, 2006 Python; Fastest way to turn HTML into text in Python January 8, 2021 Python; mincss "Clears the junk out of your CSS" January 21, 2013 Python, Web development; Django ORM optimization story on selecting the least possible February 22, 2019 Python, Web development, Django, PostgreSQL

Fastest way to match a filename's extension in Python

More premature optimization

Conclusion

Comments

Related posts