Subject: Fwd: Re: New hash algorithm - FNV - for use in the kernel
To: None <tech-perform@netbsd.org, tech-kern@netbsd.org>
From: Seth Kurtzberg <seth@cql.com>
List: tech-perform
Date: 11/26/2001 23:06:39
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I think David's primary argument here, that hashing may not be the best
design choice, is well taken.

However, on the subject of hash table sizes, it is not difficult to allow the
hash table to dynamically increase in size.  This does not impose significant
overhead; you can simply count the length of the hash buckets as you traverse
them, and trigger a dynamic resize when some threshhold is exceeded.

There would be some issues in doing so, such as memory fragmentation, but it
can be done.

Here, of course, you are trading off average case performance (which tends to
be better with hashing) against worst case performance (which would clearly
be dismal at the point where the has is dynamically reconstructed).  So, it
depends somewhat on whether the average case or the worst case is more
important; an obvious example is that you wouldn't do this sort of thing if
you are trying to get near real time functionality.

On Tuesday 27 November 2001 02:21, you wrote:
> > There is a potential issue however; FNV uses multiply (*) and xor (^),
> > whereas a lot of the existing hashes on strings and buffers just use
> > addition (+).
>
> Using multiply will (probably) stuff the performance of sytems that
> don't have fast hardware multiply (eg sparc32).
>
> > 1) vmstat -h nfsnodehash:
> >                     total     used     util      num  average  maximum
> > hash table        buckets  buckets        %    items    chain    chain
> > ----------------------------------------------------------------------
> >
> >     with existing nfs_hash():
> > nfsnodehash         65536     1000     1.53    34688       34      105
>
> The 1000 used buckets looks too much like a non-random number!
> Anyone know whether this is true?
>
> >     using fnv_32_buf()
> > nfsnodehash         65536    27054    41.28    34867        1        6
>
> One problem, I have with these kernel hash tables, is that you end up
> allocating a massive table - and require a good (computationally expensive)
> hash function - to get good performance on large systems, wasting the space
> on smaller systems.  Or generate a kernel configuration constant that is
> never tuned for the (current) workload of the system.
>
> It often isn't viable to make the hash table (and function) such that the
> load factor is < 1 - so there are almost always chains.  Leaving your
> lookup O(n) for large data sets.
>
> Surely it would be better to find an ordering relation and put the data
> into a balanced binary tree.  For large data sets this is O(log n).
>
> Of course it is probably worth having a cache of recently acessed items
> - but this can be much smaller and only needs a cheap hash function.
>
>     David

- - --
Seth Kurtzberg
Machine Independent Software
Office:  (480) 661-1849
Fax: (480) 614-8909
email:  seth@cql.com
pager:  888-605-9296 or email 6059296@skytel.com

- -- 
Seth Kurtzberg
Machine Independent Software
Office:  (480) 661-1849
Fax: (480) 614-8909
email:  seth@cql.com
pager:  888-605-9296 or email 6059296@skytel.com

-----BEGIN PGP SIGNATURE-----
Version: PGP 6.5.8

iQA/AwUBPAMtb3hkmRgYZUCaEQK4RQCbBw2P6szdPnEKoXAA3+TNOEvXXwYAoLlD
MpaU6/dKfkfOTzxwZL1+O3jz
=MUBm
-----END PGP SIGNATURE-----