Crosstips.org Crosstips.org

My fun Crossword solver project. Crosstips.org & Krysstips.se

Kung Fu Kung Fu

Fujian White Crane Kung Fu

Fry-IT

Fry-IT is the company I work for

Photos Photos

Photoalbum, both old and new.

Zope Zope

What I have and am doing with Zope

Receptsamlingen Receptsamlingen

In Swedish only. About my "Collection of Recipes" website.

Contact me Contact me

My contact details and how to contact me.

  Mobile version of this page Mobile version of this page


 

Using MD5 to check equality between files

md5, tree, digest

28th of October 2005

To some Python users this is old-school old-news stuff but since I've never used it before I found it worth mentioning.

I have a script that scans a rather large tree of folders filled with files. None of the folders have the same name but they can mistakably contain the same files eg:

 folder XYZ-2005-11-27/
    email1.bin
    email2.bin
 folder CBA-2005-07-10/
    email1.bin
    email2.bin

Sometimes two different folders contain the same file names exactly. Sometimes, the file sizes as equal too. But in some of those cases, even though the file sizes and names are the same they are different files. But! If they are the same files just in different locations I want to find them. How to do that?

The trick is to use the md5 module in Python, like this:

 f1 = file(os.path.join(path_1, os.listdir(path_1)[0]) ,'rb')
 f2 = file(os.path.join(path_2, os.listdir(path_2)[0]) ,'rb')
 print md5.new(f1.read()).digest() == md5.new(f2.read()).digest()

UPDATE As "cableguy" pointed out, the files should be opened in binary form.


Comment

Florian - 28th October 2005  [«« Reply to this]
md5 is rather slow for this purpose. It also seems to me that to simply get a checksum over a file, deploying a hash algorithm worthy a component of sophisticated encryption is rather overkill.

You might be interested in zlib.adler32 and zlib.crc32 (a bit slower, but slightly less collisions).
Peter Bengtsson - 28th October 2005   [«« Reply to this]
Slow? It takes on this pc about 0.0027 seconds to get the checksum of a 350Kb file.

But, on that note, it takes 0.0009 seconds on average with zdlib.adler32()

I wrote a little benchmark script and got these results:

A 0.00166934132576
B 0.00266071277506
C 0.000866203977351
D 0.00112253580338

where...

def A(payload):
....return hash(payload)

def B(payload):
....return md5.new(payload).digest()

def C(payload):
....return zlib.adler32(payload)

def D(payload):
....return zlib.crc32(payload)


Thanks for the pointers Florian.
Harvey - 4th November 2005   [«« Reply to this]
If you use CRC32 then you can also include the contents of zip files by using the CRC value stored in infolist() instead of having to read the file from the zip and computing the CRC.
cableguy - 28th October 2005  [«« Reply to this]
you should open the file in binary reading mode. use file(name, 'rb')
Myers Carpenter - 28th October 2005  [«« Reply to this]
Just use filecmp.cmp().
ferringb - 28th October 2005  [«« Reply to this]
check into the fchksum module, definitely difference in speed plus it doesn't require buffering the contents of the file all in mem.
Florian - 29th October 2005   [«« Reply to this]
could you please post a link to it where a hapless victim can download ready to install packages for python 2.3/2.4 for macosx, linux and windoze.

A performance comparision would be nice too. (including md5, adler32, crc32 and fchksum)
Peter Bengtsson - 30th October 2005   [«« Reply to this]
Try www.python.org or www.activestate.com
Florian - 31st October 2005   [«« Reply to this]
of course I do mean the fchksum, it's not part neither of python.org's python nor of activestates.
Peter Bengtsson - 30th October 2005   [«« Reply to this]
Thanks for the advice but I can't afford the time to test this more. The next time I write a benchmark I'll include this.
 
Name:
Email:
hide my email address.

Your email address will be encoded to prevent email-extraction spiders from reading it so you won't get spammed if you decide to show your email address.