Subject: Re: New hash algorithm - FNV - for use in the kernel
To: None <tech-perform@netbsd.org, tech-kern@netbsd.org>
From: Simon Burge <simonb@wasabisystems.com>
List: tech-kern
Date: 11/28/2001 17:13:49
der Mouse wrote:

> If you want something CRCish, I'd suggest something like
> 
> unsigned int hash(const void *data, int len)
> {
>  const unsigned char *dp;
>  unsigned int v;
> 
>  v = ~0;
>  for (dp=data;len>0;dp++,len--)
>   { v ^= *dp;
>     v = (v >> 1) ^ ((v & 1) ? 0xedb88320 : 0);
>   }
>  return(v);
> }

Adding this as "mouse" gets for filenames:

	total:  92378
	dumb:    used:  7201 max: 63
	fnv:     used: 49709 max:  9
	lennart: used: 49508 max:  8
	crc:     used: 46967 max: 80
	perl:    used: 49413 max:  9
	mouse:   used: 49137 max:  9

and binary data:

	total:  1048576
	dumb:    used:  3167 max: 1076
	fnv:     used: 65536 max:   35
	lennart: used: 65536 max: 4132
	crc:     used: 65536 max:   35
	perl:    used: 65536 max:   36
	mouse:   used: 65536 max:   35

> > 	crc:
> > 	  used: 46967
> > 	  max:  80
> 
> Very interesting.  The buckets-used value is good, but there's that
> high max.  A histogram of the bucket loads would be intersting, and if
> there really is just one peak, it would be interesting to analyze the
> similarities among the strings that hit it.

Without analysis, >99% of entries in the string case were < 10, with the
last couple (sorted numerically) being:

	 3181:     26
	13721:     26
	41076:     26
	16296:     27
	11178:     31
	24223:     31
	39751:     32
	21170:     77
	18108:     80

Similarly, the "lennart" hash with binary data finishes off with

	  252:     42
	    3:     43
	  195:     44
	   76:     46
	    0:   4132

If I change it so that it doesn't stop on a zero byte (!), the figure
look a bit better, but not great:

	lennart: used: 30316 max:  212  

with the largest buckets being:

	35668:    172
	45852:    172
	46032:    173
	35654:    174
	23575:    175
	46020:    186
	60616:    186
	45824:    193
	45969:    208
	35681:    212

> > From these two test cases, both FNV and perl look to be best.
> 
> Agreed, though I'd be interested to see similar results for, say, a
> data set consisting of several thousand unique filehandles sniffed from
> NFS traffic, since it seems that's one of the major applications.  (At
> least, it's the one that got us onto this thread. :-)

I was lazy.  If you want to shoot several thousand filehandles to
me, feel free :-)

Simon.
--
Simon Burge                            <simonb@wasabisystems.com>
NetBSD CDs, Support and Service:    http://www.wasabisystems.com/