Subject: Re: RelCache (aka ELF prebinding) news
To: None <tech-kern@netbsd.org, tech-userlevel@netbsd.org>
From: der Mouse <mouse@Rodents.Montreal.QC.CA>
List: tech-kern
Date: 12/03/2002 09:09:44
> I'm curious how you decided that the likelihood of a random collision
> was too great with a 32-bit sum.

I didn't make that decision, but as someone who thinks it's so, perhaps
I can explain.

Even under the most optimistic assumptions - that the algorithm
produces a unformly distributed random numbers in the statistical
sense, when run on the files in question - 32 bits just isn't enough.

It's the birthday paradox.  How many people do you need in a room
before it becomes a decent bet that two have the same birthday?
Roughly sqrt(365).  How many random numbers do you have to generate
form a selection of N possibilities before it becomes a decent bet that
you've got the same number twice?  Roughly sqrt(N).  (Working out
exactly when this chance reaches 1/2 is a mess.  Even if you're willing
to use Stirling's approximation to N!, it's still pretty ugly.  Loosely
put, though, the chance of a duplication is 1/N on the second item, 2/N
on the third, etc, to M-1/N on the last, for a total of M(M-1)/2N -
which gets up around 1/2 when M gets up near sqrt(N).  The precise
chance that they're all different is N!/(M! N^M), I think, for M<N.)

With 32-bit numbers, N is 2^32, which makes sqrt(N) 2^16.  It's just
not that implausible that I've approached 64K distinct executables.

> However, there are significantly better 32-bit CRC polynomials than
> the usual one;

"Better" in what sense?  The usual goodness metric for CRC polynomials
is detecting certain kinds of errors.  Here, we're simply using it as a
hashing function.  But the argument above assumes that the hashing
function is, for our purposes, perfect: that it can be modeled as
generating a uniform random number that is somehow forever associated
with that particular file contents.

> (since the relcache can obviously use the file size to eliminate all
> other collisions).

I don't think I've seen this pointed out before; I'm fairly sure it
wasn't in any of the earlier descriptions I've read.  It helps, though
I don't thnk it helps enough for me to be content with 32 bits.

> Note that the modification time of the file can also be used to
> further reduce the number of random collisions,

...at the price of ballooning the cache directory with new hashes every
time you rebuild the world.

> Also note that the commonly-used "CRC-64", originally from Matt
> Dillon's "diabolo" code, I believe, uses a rather poorly chosen
> polynomial (to begin with, it's really only 56 bits -- one end is all
> zeroes!) and is not a great choice.

CRC polynomials have a "hidden" 1 at one end (which end depends on how
you represent them); provided the zeroes are at that end, they don't
necessarily mean anything.  Consider, for example, 0xedb88320, one of
the standard 32-bit polynomials - it has five zeroes at one end, but
that doesn't mean you "really" have have a CRC of only 27 bits.

/~\ The ASCII				der Mouse
\ / Ribbon Campaign
 X  Against HTML	       mouse@rodents.montreal.qc.ca
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B