Subject: Re: RelCache (aka ELF prebinding) news
To: der Mouse <mouse@Rodents.Montreal.QC.CA>
From: Andrew Brown <atatat@atatdot.net>
List: tech-userlevel
Date: 12/03/2002 10:43:44
>> I'm curious how you decided that the likelihood of a random collision
>> was too great with a 32-bit sum.
>
>I didn't make that decision, but as someone who thinks it's so, perhaps
>I can explain.
>
>Even under the most optimistic assumptions - that the algorithm
>produces a unformly distributed random numbers in the statistical
>sense, when run on the files in question - 32 bits just isn't enough.
>
>It's the birthday paradox.  How many people do you need in a room
>before it becomes a decent bet that two have the same birthday?
>Roughly sqrt(365).  How many random numbers do you have to generate
>form a selection of N possibilities before it becomes a decent bet that
>you've got the same number twice?  Roughly sqrt(N).  (Working out
>exactly when this chance reaches 1/2 is a mess.  Even if you're willing
>to use Stirling's approximation to N!, it's still pretty ugly.  Loosely
>put, though, the chance of a duplication is 1/N on the second item, 2/N
>on the third, etc, to M-1/N on the last, for a total of M(M-1)/2N -
>which gets up around 1/2 when M gets up near sqrt(N).  The precise
>chance that they're all different is N!/(M! N^M), I think, for M<N.)
>
>With 32-bit numbers, N is 2^32, which makes sqrt(N) 2^16.  It's just
>not that implausible that I've approached 64K distinct executables.

for actual birthdays, you need 23 people to get a greater than 50%
chance of a collision, not sqrt(365), which is 19 (though i don't
think i'd but until you reached 28 people).  19:23 == 0.826.

for a prefectly random 32 bit number[1], you need 77164 samples to get
greater than 50% chance of a collision.  that's a tad more than 2^16,
though not quite 2^17.  the checkflist code in distrib/sets/checkflist
generates a list of 15214 things, a lot of which are directories and
most of which are not executable.  65536:77164 == 0.849.

i'm sure there's a formula somewhere for calculating the exact number
of elements in the subset you need for a birthday collision given the
size of the set, but i don't know what it is.

[1] a 32 bit crc is most decidedly *not* a random 32 bit number.  it's
quite deterministic, and the *function* of a crc is not to give a
token that is likely unique (ie, long-lived), but one that will
cheaply indicate whether the data it describes has somehow been munged
(ie, an expected lifetime of "one").

-- 
|-----< "CODE WARRIOR" >-----|
codewarrior@daemon.org             * "ah!  i see you have the internet
twofsonet@graffiti.com (Andrew Brown)                that goes *ping*!"
werdna@squooshy.com       * "information is power -- share the wealth."