Subject: more Q&D results [hash for use in the kernel]
To: None <tech-kern@netbsd.org, tech-perform@netbsd.org>
From: ozan s. yigit <oz@zonzorp.canada.sun.com>
List: tech-perform
Date: 11/28/2001 13:55:30
i include two sets of results, consider them preliminary and also consider
reading a paper published long ago: Mckenzie et al. "Selecting a Hashing
Algorithm" [1]. the test harness that generates these results was written
for a follow-up paper i was once working on.  [i guess I should dust that
up and finish it] the important thing in these results is the Rn, which
can be described as the "closeness" to a perfect function for that
particular set of data. 1 would be perfect scatter but of course
we will be happy with anything that is very close. 

just for fun, these results are for all C indentifiers in X11R4
(i think; it was gathered a time ago) hashed to a) an 8K table, and
b) a prime table closest to 8K.

i added the "perl" hash function that was posted here; it is a variant
of the bernstein's k=33 hash, and the multiplication can be eliminated.
it seems to do better than FNV for anything i have so far tried. it can
be implemented with shifts and additions only, so does not need to pay
multiplication cost.

don't get hang-up in these results; i can tell you that bernstein and its
variants generally does well, along with few others. as i said before
FNV is nothing extra special, so far as i can tell from running these
tests since SP&E paper appeared over various types of string data.

more on this anon.

oz
---
[1]  B. J. McKenzie, R. Harries, T. Bell.
     Selecting a Hashing Algorithm
     Software - Practice and Experience
     20(2):209-224, Feb 1990.

ozan s. yigit			staff engineer, sun microsystems/es
http://www.cs.yorku.ca/~oz	ozan.yigit@sun.com || +1 [905] 415 2878
---
the pen is mightier than the sword if the sword is very short and the
pen is very sharp	-- Terry Pratchett

--- 8192 ---

X11ID: 16125 words
hashtable size 8192
W(W+N)/2N = 23932
                            unused  coll          avg     st     lookup
function 	           buckets  free    Rn   chain   dev    time
thinking machines/wais      13.87%  14.10%  1.339  2.3   1.408   0.168
knuth/TAOCP                 20.64%  14.83%  1.777  2.5   2.131   0.140
icon (k=1)                  71.78%   2.51%  4.173  7.0   4.306   0.136
bernstein                   14.37%  13.84%  1.347  2.3   1.424   0.160
phong vo (k=37)             13.68%  14.47%  1.335  2.3   1.400   0.136
larson/linear               13.98%  13.98%  1.338  2.3   1.405   0.146
oz/num (k=153)              14.23%  14.16%  1.344  2.3   1.419   0.161
bernstein (k=33)            13.53%  14.12%  1.326  2.3   1.380   0.161
phong vo                    14.47%  13.63%  1.339  2.3   1.407   0.168
K&R/torek (k=31)            14.55%  13.67%  1.345  2.3   1.421   0.132
corbett/byacc               13.83%  14.02%  1.334  2.3   1.396   0.097
pcc/cpp/c++ (k=2)           20.14%  14.00%  1.590  2.5   1.857   0.104
gnu emacs                   24.34%  14.24%  1.986  2.6   2.400   0.140
drdobbs/john boyer          14.47%  13.90%  1.344  2.3   1.418   0.237
danny sleator/parse         15.95%  13.64%  1.380  2.3   1.490   0.148
holzmann/protocols          16.21%  14.24%  1.413  2.3   1.554   0.116
perl                        13.22%  14.16%  1.322  2.3   1.372   0.132
rayan/ssl/zmailer           25.51%  14.77%  1.951  2.6   2.357   0.262
oz/sdbm (k=65587)           13.39%  13.98%  1.326  2.3   1.380   0.173
ken thompson (dbm)          14.12%  13.84%  1.338  2.3   1.405   0.213
weinberger/dragon book      33.44%  11.90%  2.072  3.0   2.502   0.142
mersenne                    13.71%  14.21%  1.334  2.3   1.396   0.280
pearson/cargill             14.84%  14.12%  1.357  2.3   1.444   0.096
knuth/graphbase (k=314159)  14.25%  13.68%  1.344  2.3   1.417   0.165
hanson/fraser/lcc           17.09%  14.32%  1.523  2.4   1.749   0.146
oz/sdbm (k=65599)           14.31%  13.85%  1.339  2.3   1.407   0.162
stu wecker/decus diff       14.28%  13.96%  1.340  2.3   1.409   0.130
gnu cpp (k=4)               36.21%  12.71%  3.203  3.1   3.588   0.146
gonnet/baeza-yates (k=131)  13.94%  13.79%  1.330  2.3   1.389   0.140
kim walden/depend           32.19%  15.60%  4.068  2.9   4.234   0.177
phong vo (k=129)            14.05%  14.20%  1.350  2.3   1.430   0.139
un*x refer                  36.90%   7.28%  1.936  3.1   2.339   0.286
stevens/db                  21.03%  13.62%  1.556  2.5   1.802   0.197
fowler/noll/vo              14.20%  14.21%  1.347  2.3   1.424   0.159
gcc front-end (k=613)       13.90%  14.13%  1.343  2.3   1.415   0.146
horspool/bsdbook (k=13753)  13.61%  13.74%  1.328  2.3   1.383   0.183
c++/chop                    51.32%   2.31%  2.069  4.0   2.499   0.113
numerical recipes (ranqd1)  13.70%  14.18%  1.337  2.3   1.404   0.168
honeyman/pathalias          20.28%  15.47%  4.381  2.5   4.445   0.189

--- 8209 ---

X11ID: 16125 words
hashtable size 8209
W(W+N)/2N = 23899
                            unused  coll          avg     st     lookup
function 	           buckets  free    Rn   chain   dev    time
thinking machines/wais      14.08%  13.82%  1.339  2.3   1.404   0.173
knuth/TAOCP                 14.28%  14.15%  1.344  2.3   1.416   0.135
icon (k=1)                  71.84%   2.51%  4.179  7.0   4.303   0.136
bernstein                   13.67%  14.10%  1.332  2.3   1.391   0.164
phong vo (k=37)             14.46%  13.49%  1.337  2.3   1.400   0.139
larson/linear               14.14%  14.00%  1.343  2.3   1.413   0.150
oz/num (k=153)              14.24%  13.98%  1.344  2.3   1.415   0.158
bernstein (k=33)            14.08%  14.02%  1.338  2.3   1.403   0.157
phong vo                    14.34%  13.69%  1.338  2.3   1.402   0.167
K&R/torek (k=31)            13.75%  14.26%  1.339  2.3   1.404   0.132
pcc/cpp/c++ (k=2)           17.10%  13.59%  1.414  2.4   1.554   0.098
gnu emacs                   14.39%  14.10%  1.346  2.3   1.419   0.130
drdobbs/john boyer          14.59%  13.84%  1.344  2.3   1.415   0.212
danny sleator/parse         14.35%  14.13%  1.345  2.3   1.417   0.145
holzmann/protocols          17.16%  13.99%  1.437  2.4   1.595   0.116
perl                        14.11%  14.07%  1.337  2.3   1.401   0.132
rayan/ssl/zmailer           14.42%  14.07%  1.349  2.3   1.426   0.255
oz/sdbm (k=65587)           13.74%  14.02%  1.331  2.3   1.387   0.182
ken thompson (dbm)          13.36%  14.27%  1.327  2.3   1.380   0.211
weinberger/dragon book      13.74%  14.28%  1.336  2.3   1.398   0.130
mersenne                    13.34%  14.07%  1.325  2.3   1.375   0.280
pearson/cargill             14.24%  13.98%  1.344  2.3   1.415   0.102
knuth/graphbase (k=314159)  14.01%  14.21%  1.336  2.3   1.398   0.168
hanson/fraser/lcc           14.37%  13.84%  1.342  2.3   1.412   0.161
oz/sdbm (k=65599)           13.90%  14.07%  1.339  2.3   1.405   0.163
stu wecker/decus diff       14.03%  14.18%  1.344  2.3   1.416   0.128
gnu cpp (k=4)               15.42%  14.31%  1.418  2.3   1.560   0.118
gonnet/baeza-yates (k=131)  13.86%  14.45%  1.338  2.3   1.404   0.140
kim walden/depend           25.65%  15.62%  2.492  2.6   2.948   0.149
phong vo (k=129)            13.57%  14.15%  1.330  2.3   1.386   0.136
un*x refer                  13.64%  14.20%  1.330  2.3   1.386   0.271
stevens/db                  20.70%  14.39%  1.559  2.5   1.805   0.201
fowler/noll/vo              14.00%  14.16%  1.342  2.3   1.411   0.159
gcc front-end (k=613)       13.83%  14.15%  1.334  2.3   1.394   0.143
horspool/bsdbook (k=13753)  14.00%  14.13%  1.340  2.3   1.408   0.187
c++/chop                    51.42%   2.31%  2.071  4.0   2.498   0.124
numerical recipes (ranqd1)  13.95%  13.99%  1.339  2.3   1.405   0.164
honeyman/pathalias          13.95%  14.01%  1.331  2.3   1.389   0.146

--- end ---