Best non-cryptographic hashing function in Python (size and speed)

21 February 2015   7 comments   Python

First of all; hashing is hard. But fortunately it gets a little bit easier if it doesn't have to cryptographic. A non-cryptographic hashing function is basically something that takes a string and converts it to another string in a predictable fashion and it tries to do it with as few clashes as possible and as fast as possible.

MD5 is a non-cryptographic hashing function. Unlike things like sha256 or sha512 the MD5 one is a lot more predictable.

Now, how do you make a hashing function that yields a string that is as short as possible? The simple answer is to make the output use as many different characters as possible. If a hashing function only returns integers you only have 10 permutations per character. If you instead use a-z and A-Z and 0-9 you now have 26 + 26 + 10 permutations per character.

A hex on the other hand only uses 0-9 and a-f which is only 10 + 6 permutations. So you need a longer string to be sure it's unique and can't clash with another hash output. Git for example uses a 40 character log hex string to prepresent a git commit. GitHub is using an appreviated version of that in some of the web UI of only 7 characters which they get away with because things are often in a context of a repo name or something like that. For example github.com/peterbe/django-peterbecom/commit/462ae0c

So, what other choices do you have when it comes to returning a hash output that is sufficiently long that it's "almost guaranteed" to be unique but sufficiently short that it becomes practical in terms of storage space? I have an app for example that turns URLs into unique IDs because they're shorter that way and more space efficient to store as values in a big database. One such solution is to use a base64 encoding.

Base64 uses a-zA-Z0-9 but you'll notice it doesn't have the "hashing" nature in that it's just a direct translation character by character. E.g.

>>> base64.encodestring('peterbengtsson')
'cGV0ZXJiZW5ndHNzb24=\n'
>>> base64.encodestring('peterbengtsson2')
'cGV0ZXJiZW5ndHNzb24y\n'

I.e. these two strings are different but suppose you were to take only the first 10 characters these would be the same. Basically, here's a terrible hashing function:

def hasher(s):  # this is not a good hashing function
    return base64.encodestring(s)[:10]

So, what we want is a hashing function that returns output that is short and very rarely clashing and does this as fast as possible.

To test this I wrote a script that tried a bunch of different ad-hoc hashing functions. I generate a list of 130,000+ different words with an average length of 15 characters. Then I loop over these words until a hashed output is repeated for a second time. And for each, I take the time it takes to generate the 130,000+ hashes and I multiply that with the total number of bytes. For example, if the hash output is 9 characters each in length that's (130000 * 9) / 1024 ~= 1142Kb. And if it took 0.25 seconds to generate all of those the combined score is 1142 * 0.24 ~= 286 bytes second.

Anyway, here are the results:

h11 100.00  0.217s  1184.4 Kb   257.52 kbs
h6  100.00  1.015s  789.6 Kb    801.52 kbs
h10 100.00  1.096s  789.6 Kb    865.75 kbs
h1  100.00  0.215s  4211.2 Kb   903.46 kbs
h4  100.00  1.017s  921.2 Kb    936.59 kbs

(kbs means "kilobytes seconds")

These are the functions that returned 0 clashes amongst 134,758 unique words. There were others too that I'm not bothering to include because they had clashes. So let's look at these functions:

def h11(w):
    return hashlib.md5(w).hexdigest()[:9]

def h6(w):
    h = hashlib.md5(w)
    return h.digest().encode('base64')[:6]

def h10(w):
    h = hashlib.sha256(w)
    return h.digest().encode('base64')[:6]

def h1(w):
    return hashlib.md5(w).hexdigest()

def h4(w):
    h = hashlib.md5(w)
    return h.digest().encode('base64')[:7]    

It's kinda arbitrary to say the "best" one is the one that takes the shortest time multipled by size. Perhaps the size matters more to you in that case the h6() function is better because it returns 6 character strings instead of 9 character strings in h11.

I'm apprehensive about publishing this blog post because I bet I'm doing this entirely wrong. Perhaps there are better ways to digest a hashing function that returns strings that don't need to be base64 encoded. I just haven't found any in the standard library yet.

Comments

Sybren

0-9 are 10 possibilities, not 9 ;-)

Peter Bengtsson

Right you are!

Haxman

Hi! This page is currently the 6th Google result for [python fast hash]! Unfortunately, the analysis here is somewhat confused, which could lead beginners astray.

MD5 would be better described as a flawed cryptographic hash function, rather than a "non-cryptographic hash function". (MD5 tried to be cryptographically strong, and was used in that fashion for quite a while, but turned out to be broken. A truly designed-to-be non-cryptographic hash function would be simpler and faster; more on that later.)

To meaningfully compare algorithms, you should do so based on their raw output (bits/bytes), not multiple varying printable representations (hex/base64, possibly truncated). Those encoding/truncation decisions are logically separate, and can be done at the end, to any suitable algorithm.

Against a random set of inputs like your "130,000+" words, taking the same number of leading bits from either MD5 or SHA256 will be equally likely to be free of collisions. (The known flaws in MD5 only come into play if inputs are chosen adversarially to engineer collisions.)

MD5 will be faster for the same amount of input, since it is a simpler algorithm with smaller state and output. (Your timing scores based on output-characters, as modified by a further encoding/truncation choice, are a bit fishy. It's isn't typical to rate a hash's speed by output characters, and your times may be dominated by other choices in your loop.)

Given your test – 130,000 (around 2^17) random words – there's a "birthday problem" probability calculation that predicts exactly how many unique, randomly-distributed output values any hash function would need to offer to make even one collision among those 2^17 probes unlikely: around 2^34. So it's no surprise that each of your shortest surviving hashes encodes 36 bits, with just the right number of characters to go over 34 bits. (Hex characters, with 16 values, encode 4 bits. So 8 characters (8*4=32 bits) would be too few bits, while 9 characters (8*4=36) is just enough. Similarly base32 characters encode 6 bits, so 5 characters (5*6=30 bits) would be too few, while 6 characters (6*6=36 bits) is just enough.)

So what technique should be used to create the sort of usually-unique, stable, short IDs you want?

If you need to stick with the Python standard library, starting with MD5 is a defensible choice. Then if compactness in text representation matters, represent the hashes in base64. Then if you want to accept more collision-risk than is likely with the full 128 bits, use the birthday-problem calculations to pick a number of retained bits that fit your size/collision tradeoffs, and truncate away the rest.

But remember MD5 is slower than hash functions that never intended to have cryptographic collision-resistance as a goal. So if speed is a goal, you could look outside the standard library. If you'd like to use the fast non-cryptographic hash functions that Google has published, look into their projects "CityHash" or "FarmHash". (These even have 256-bit options, on the off chance that the collision-risk with 128-bits is still too high for your application.) The first 36 (or 64 or 80 or 128) bits from these should be just as good as the bits from MD5, for your purposes. Then, apply the same encoding and truncation choices as may be appropriate.

Peter Bengtsson

Thanks for that insightful comment. Really appreciate it!

In my Autocompeter.com service (written in Go) I actually went for version h6(). I needed it to be as small as possible and that one produces, very quickly, a hash that is only 6 characters long and seems very safe for clashes.

Karel Vervaeke

Hi. Your article was useful to me, but you missed a teachable moment by omitting the 'bad' hashes. See https://en.wikipedia.org/wiki/Survivorship_bias :)

erik

Python has a built-in hash() function that's faster than all of these.

Dan Girellini

The build-in hash function will not always return the same results across different invocations of your program. Check out the note at the bottom of the hash documentation about PYTHONHASHSEED: https://docs.python.org/3/reference/datamodel.html#object.__hash__

Your email will never ever be published


Related posts

Previous:
My favorite Go multiplexer 28 January 2015
Next:
Median size of Javascript libs on jsDelivr 24 February 2015
Related by Keyword:
All your images are belong to data uris 06 January 2013
WSSE Authentication and Apache 13 December 2007
Using MD5 to check equality between files 28 October 2005
New search feature on this site 13 September 2003
Related by Text:
jQuery and Highslide JS 08 January 2008
I'm back! Peterbe.com has been renewed 05 June 2005
Anti-McCain propaganda videos 12 August 2008
Ever wondered how much $87 Billion is? 04 November 2003
Guake, not Yakuake or Yeahconsole 23 January 2010