21 February 2015 9 comments Python
First of all; hashing is hard. But fortunately it gets a little bit easier if it doesn't have to cryptographic. A non-cryptographic hashing function is basically something that takes a string and converts it to another string in a predictable fashion and it tries to do it with as few clashes as possible and as fast as possible.
MD5 is a non-cryptographic hashing function. Unlike things like sha256 or sha512 the MD5 one is a lot more predictable.
Now, how do you make a hashing function that yields a string that is as short as possible? The simple answer is to make the output use as many different characters as possible. If a hashing function only returns integers you only have 10 permutations per character. If you instead use
0-9 you now have 26 + 26 + 10 permutations per character.
A hex on the other hand only uses
which is only 10 + 6 permutations. So you need a longer string to be sure it's unique and can't clash with another hash output. Git for example uses a 40 character log hex string to prepresent a git commit. GitHub is using an appreviated version of that in some of the web UI of only 7 characters which they get away with because things are often in a context of a repo name or something like that. For example github.com/peterbe/django-peterbecom/commit/462ae0c
So, what other choices do you have when it comes to returning a hash output that is sufficiently long that it's "almost guaranteed" to be unique but sufficiently short that it becomes practical in terms of storage space? I have an app for example that turns URLs into unique IDs because they're shorter that way and more space efficient to store as values in a big database. One such solution is to use a base64 encoding.
but you'll notice it doesn't have the "hashing" nature in that it's just a direct translation character by character. E.g.
>>> base64.encodestring('peterbengtsson') 'cGV0ZXJiZW5ndHNzb24=\n' >>> base64.encodestring('peterbengtsson2') 'cGV0ZXJiZW5ndHNzb24y\n'
I.e. these two strings are different but suppose you were to take only the first 10 characters these would be the same. Basically, here's a terrible hashing function:
def hasher(s): # this is not a good hashing function return base64.encodestring(s)[:10]
So, what we want is a hashing function that returns output that is short and very rarely clashing and does this as fast as possible.
To test this I wrote a script that tried a bunch of different ad-hoc hashing functions. I generate a list of 130,000+ different words with an average length of 15 characters. Then I loop over these words until a hashed output is repeated for a second time. And for each, I take the time it takes to generate the 130,000+ hashes and I multiply that with the total number of bytes. For example, if the hash output is 9 characters each in length that's
(130000 * 9) / 1024 ~= 1142Kb. And if it took 0.25 seconds to generate all of those the combined score is
1142 * 0.24 ~= 286 bytes second.
Anyway, here are the results:
h11 100.00 0.217s 1184.4 Kb 257.52 kbs h6 100.00 1.015s 789.6 Kb 801.52 kbs h10 100.00 1.096s 789.6 Kb 865.75 kbs h1 100.00 0.215s 4211.2 Kb 903.46 kbs h4 100.00 1.017s 921.2 Kb 936.59 kbs
kbs means "kilobytes seconds")
These are the functions that returned 0 clashes amongst 134,758 unique words. There were others too that I'm not bothering to include because they had clashes. So let's look at these functions:
def h11(w): return hashlib.md5(w).hexdigest()[:9] def h6(w): h = hashlib.md5(w) return h.digest().encode('base64')[:6] def h10(w): h = hashlib.sha256(w) return h.digest().encode('base64')[:6] def h1(w): return hashlib.md5(w).hexdigest() def h4(w): h = hashlib.md5(w) return h.digest().encode('base64')[:7]
It's kinda arbitrary to say the "best" one is the one that takes the shortest time multipled by size. Perhaps the size matters more to you in that case the
h6() function is better because it returns 6 character strings instead of 9 character strings in
I'm apprehensive about publishing this blog post because I bet I'm doing this entirely wrong. Perhaps there are better ways to digest a hashing function that returns strings that don't need to be base64 encoded. I just haven't found any in the standard library yet.