Re: FYI performance of hash algorithms [message #177853 is a reply to message #177852] |
Mon, 23 April 2012 20:16 |
Jerry Stuckle
Messages: 2598 Registered: September 2010
Karma:
|
Senior Member |
|
|
On 4/23/2012 3:42 PM, M. Strobel wrote:
> I am just asking myself which of the many hash algorithms performs best, so I can use
> it to index my translation strings.
>
> So I wrote a test script. It iterates over an array of 3.917.116 strings calculating
> the hash of each.
>
> Here you have the results, with run time, and run time relative to hash length
> (smaller is better).
> Interesting that the tiger hashes seem to perform well.
>
> There were only collisions with adler32 (many), crc32 and crc32b.
>
> /Str.
>
<snip>
Interesting, but only applicable on your system with your libraries and
your data. A different system could have different versions of the
libraries and completely different performance figures.
As for collisions - completely dependent on your data. While you can
say some things with certainty (i.e. hashing 2^16+1 values with CRC-16
is guaranteed to have at least one collision), by picking different
strings to be hashed your collisions will come out differently.
For instance, it's not really surprising that adler32 had a lot of
collisions. It was designed for CRC and works better with binary data;
it was never meant to be a general purpose hashing algorithm. Rather it
was designed for speed. It's also very short.
The same is true but to a lesser extent with the CRC-32 and CRC-32b
algorithms, so again I'm not surprised about the collisions.
But the real question here is - how much does it matter if hashing a
string takes 2 microseconds or 2.5 microseconds? Unless you're doing
millions of them at one time, no one is going to notice the difference.
Personally, when looking at such things I always elect the fewest
collisions over the fastest time.
--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex(at)attglobal(dot)net
==================
|
|
|