Subject: Re: RelCache (aka ELF prebinding) news
To: None <tech-kern@netbsd.org, tech-userlevel@netbsd.org>
From: Thor Lancelot Simon <tls@rek.tjls.com>
List: tech-kern
Date: 12/03/2002 10:32:31
On Tue, Dec 03, 2002 at 09:09:44AM +0100, der Mouse wrote:
> > I'm curious how you decided that the likelihood of a random collision
> > was too great with a 32-bit sum.
> 
> I didn't make that decision, but as someone who thinks it's so, perhaps
> I can explain.
> 
> Even under the most optimistic assumptions - that the algorithm
> produces a unformly distributed random numbers in the statistical
> sense, when run on the files in question - 32 bits just isn't enough.
> 
> It's the birthday paradox.  How many people do you need in a room
> before it becomes a decent bet that two have the same birthday?
> Roughly sqrt(365).  How many random numbers do you have to generate
> form a selection of N possibilities before it becomes a decent bet that
> you've got the same number twice?  Roughly sqrt(N).  (Working out
> exactly when this chance reaches 1/2 is a mess.  Even if you're willing
> to use Stirling's approximation to N!, it's still pretty ugly.  Loosely
> put, though, the chance of a duplication is 1/N on the second item, 2/N
> on the third, etc, to M-1/N on the last, for a total of M(M-1)/2N -
> which gets up around 1/2 when M gets up near sqrt(N).  The precise
> chance that they're all different is N!/(M! N^M), I think, for M<N.)
> 
> With 32-bit numbers, N is 2^32, which makes sqrt(N) 2^16.  It's just
> not that implausible that I've approached 64K distinct executables.

I think it's unlikely that you've got 64K distinct .so files that you've
ever mmaped into an executable.  I personally have 500, and I rebuild few
of them ever, except for the two dozen or so that are in /usr/lib.

However, we probably care about the question "how often does _anyone
anywhere_..." and that's much worse.  On the plus size, we have all the
file metadata to use to eliminate collisions: file name, modification
time, file size.  Where my mathematical abilities fall short is in
determining how many collisions this is likely to actually remove, but
I'm pretty sure it's the overwhelming majority of them.  If you added
the inode number to the mix...

I don't think it's harmful to do that, actually, because in the great
majority of cases if you rebuild your libraries they will all change
(I saw no suggestion that cache info for libraries that have gone MIA
would be retained, and nobody else's system does this AFAICT; I just
looked at several).

I walked myself through the birthday-paradox reasoning last night, then
took some empirical samples to determine how often the other metrics
for a given pair of object files would be the same.  You can easily do
the same thing yourself; it is extremely uncommon, as I think you will
discover; if you add inode number to the mix -- if someone moves all of
his system's shared libraries, regenerating the relcache is probably not
unreasonable -- it's way beyond that.

However, with no way to characterize how often CRC will actually collide
under those constraints, I'm still left doing a lot of guessing.  From a
pragmatic point of view, it seems reasonable to use both a 32-bit CRC
and a 32-bit sum computed using a completely different method, plus the
metadata.  We can't _really_ say how often there will be collisions, but
I'd bet you an awful lot of money that you won't see one this decade.

-- 
 Thor Lancelot Simon	                                      tls@rek.tjls.com
   But as he knew no bad language, he had called him all the names of common
 objects that he could think of, and had screamed: "You lamp!  You towel!  You
 plate!" and so on.              --Sigmund Freud