Using MD5 to check equality between files

28 October 2005   11 comments   Python

Mind That Age!

This blog post is 12 years old! Most likely, its content is outdated. Especially if it's technical.

Powered by Fusion×

To some Python users this is old-school old-news stuff but since I've never used it before I found it worth mentioning.

I have a script that scans a rather large tree of folders filled with files. None of the folders have the same name but they can mistakably contain the same files eg:

folder XYZ-2005-11-27/
folder CBA-2005-07-10/

Sometimes two different folders contain the same file names exactly. Sometimes, the file sizes as equal too. But in some of those cases, even though the file sizes and names are the same they are different files. But! If they are the same files just in different locations I want to find them. How to do that?

The trick is to use the md5 module in Python, like this:

f1 = file(os.path.join(path_1, os.listdir(path_1)[0]) ,'rb')
f2 = file(os.path.join(path_2, os.listdir(path_2)[0]) ,'rb')
print ==

UPDATE As "cableguy" pointed out, the files should be opened in binary form.


md5 is rather slow for this purpose. It also seems to me that to simply get a checksum over a file, deploying a hash algorithm worthy a component of sophisticated encryption is rather overkill.

You might be interested in zlib.adler32 and zlib.crc32 (a bit slower, but slightly less collisions).
Peter Bengtsson
Slow? It takes on this pc about 0.0027 seconds to get the checksum of a 350Kb file.

But, on that note, it takes 0.0009 seconds on average with zdlib.adler32()

I wrote a little benchmark script and got these results:

A 0.00166934132576
B 0.00266071277506
C 0.000866203977351
D 0.00112253580338


def A(payload):
....return hash(payload)

def B(payload):

def C(payload):
....return zlib.adler32(payload)

def D(payload):
....return zlib.crc32(payload)

Thanks for the pointers Florian.
If you use CRC32 then you can also include the contents of zip files by using the CRC value stored in infolist() instead of having to read the file from the zip and computing the CRC.
you should open the file in binary reading mode. use file(name, 'rb')
Myers Carpenter
Just use filecmp.cmp().
check into the fchksum module, definitely difference in speed plus it doesn't require buffering the contents of the file all in mem.
could you please post a link to it where a hapless victim can download ready to install packages for python 2.3/2.4 for macosx, linux and windoze.

A performance comparision would be nice too. (including md5, adler32, crc32 and fchksum)
of course I do mean the fchksum, it's not part neither of's python nor of activestates.
Peter Bengtsson
Thanks for the advice but I can't afford the time to test this more. The next time I write a benchmark I'll include this.
Thank you for posting a comment

Your email will never ever be published

Related posts

Shane's Bit Mountain 25 October 2005
www aliases set up 01 November 2005
Related by Keyword:
Best non-cryptographic hashing function in Python (size and speed) 21 February 2015
HTML Tree on Hacker News 18 May 2014
Related by Text:
I'm back and awake! 19 October 2004
Announcing Smurl - a free URL compressor 07 September 2005
ztar - my wrapper on tar -z 29 June 2005
How to use premailer as a command line script 13 July 2012
Running simple SQL commands on the command line 08 January 2005