Subject: Re: more Q&D results [hash for use in the kernel]
To: None <tech-kern@netbsd.org, tech-perform@netbsd.org>
From: Simon Burge <simonb@wasabisystems.com>
List: tech-kern
Date: 11/30/2001 02:15:06
"David Laight" wrote:

> How good are these hash functions on non-random binary data?
> Hashing random data is (should be) trivial.

I've done some more playing, and generated some filehandles (look at
getfh(2) for a handy syscall).  Here's the outputs, sorted by time per
hash:

	> ./hashtest b 262144 28 < fhlist.245
	total items:  141431
	hash         used col/fr    max    avg  s/dev  hash time
	torek       82948  48632     53   1.71   1.24  2.9861 us
	dumb          538      3    563 262.88 166.49  3.7416 us
	bob        104911  78337     54   1.35   0.91  3.8239 us
	gawk       105737  79303     54   1.34   0.91  4.1064 us
	lennart     56030  16700     54   2.52   1.71  4.4145 us
	perlxor     87362  53325     53   1.62   1.15  4.5534 us
	perl        95307  62480     54   1.48   1.01  4.5536 us
	byacc       84908  51442     53   1.67   1.21  4.5555 us
	bernstein   95312  62484     54   1.48   1.01  4.5828 us
	tcl         62361  30442     53   2.27   1.88  4.7169 us
	crc         45092  13904     54   3.14   2.59  5.3558 us
	mouse        1604     62    225  88.17  70.49  5.3584 us
	python     105190  78668     55   1.34   0.91  5.3858 us
	pjw         12346   1615     67  11.46  11.30  5.8344 us
	honeyman    84764  42034     54   1.67   1.07  6.1975 us
	fnv        103685  76333     54   1.36   0.93  8.4586 us

	> cat flist.[245] | ./hashtest s 262144
	total items:  141592
	hash         used col/fr    max    avg  s/dev  hash time
	torek      109576  82751      6   1.29   0.56 10.7280 us
	bob        109438  82526      7   1.29   0.56 11.9131 us
	dumb         9270    907     69  15.27  12.06 12.5749 us
	gawk       109509  82703      6   1.29   0.56 13.0312 us
	lennart     57976  16236     10   2.44   1.32 14.0417 us
	perlxor    109434  82583      7   1.29   0.56 14.1793 us
	perl       109456  82614      7   1.29   0.56 14.1799 us
	byacc      109472  82622      6   1.29   0.56 14.1801 us
	tcl        108727  81513      7   1.30   0.58 14.1818 us
	bernstein  109425  82572      7   1.29   0.56 14.2112 us
	crc         99124  72839    156   1.43   1.57 15.7868 us
	mouse      106222  78926     11   1.33   0.66 15.7880 us
	python     109373  82461      7   1.29   0.57 15.8151 us
	pjw         86060  57144     90   1.65   1.43 17.1417 us
	honeyman   102331  74604    129   1.38   1.18 17.4236 us
	fnv        109542  82788      6   1.29   0.56 22.0791 us

Also interesting is the size of the functions (not including any data
needed by "lennart" and "honeyman") in bytes for a few architectures.

	function	sparc	alpha	ns32k
	bernstein          76	   68	   52
	bobhash           708	  940	  539
	byacchash          68	   56	   46
	crchash            80	   88	   61
	dumbhash           52	   48	   38
	fnv               100	   92	   47
	gawkhash          332	  396	  290
	honeyman           92	   88	   64
	lennart            72	  132	   66
	mousehash          80	   92	   71
	perlhash           72	   68	   52
	perlxorhash        72	   72	   52
	pjwhash            84	  100	   69
	pythonhash         92	  116	   66
	tclhash            68	   52	   42
	torekhash         268	  300	  210

"bob", perhaps the best function in terms of both speed and spread, is
also by far the largest.  Is it's size a concern from a Icache polution
POV?

"torek" is the quickest, but is only good and not very good on binary
data.  It does however have the obfuscation advantage of using Duff's
device :-)

So, I'm wondering if outright speed or hash distrubtion is the main
criteria for selecting an algorithm?  Leaders appear to be:

	size doesn't matter:	bob
	speed important:	torek
	distrubtion important:	gawk

Simon.
--
Simon Burge                            <simonb@wasabisystems.com>
NetBSD CDs, Support and Service:    http://www.wasabisystems.com/