On Tue, Dec 03, 2002 at 09:09:44AM +0100, der Mouse wrote: > > I'm curious how you decided that the likelihood of a random collision > > was too great with a 32-bit sum. > > I didn't make that decision, but as someone who thinks it's so, perhaps > I can explain. > > Even under the most optimistic assumptions - that the algorithm > produces a unformly distributed random numbers in the statistical > sense, when run on the files in question - 32 bits just isn't enough. > > It's the birthday paradox. How many people do you need in a room > before it becomes a decent bet that two have the same birthday? > Roughly sqrt(365). How many random numbers do you have to generate > form a selection of N possibilities before it becomes a decent bet that > you've got the same number twice? Roughly sqrt(N). (Working out > exactly when this chance reaches 1/2 is a mess. Even if you're willing > to use Stirling's approximation to N!, it's still pretty ugly. Loosely > put, though, the chance of a duplication is 1/N on the second item, 2/N > on the third, etc, to M-1/N on the last, for a total of M(M-1)/2N - > which gets up around 1/2 when M gets up near sqrt(N). The precise > chance that they're all different is N!/(M! N^M), I think, for M<N.) > > With 32-bit numbers, N is 2^32, which makes sqrt(N) 2^16. It's just > not that implausible that I've approached 64K distinct executables. I think it's unlikely that you've got 64K distinct .so files that you've ever mmaped into an executable. I personally have 500, and I rebuild few of them ever, except for the two dozen or so that are in /usr/lib. However, we probably care about the question "how often does _anyone anywhere_..." and that's much worse. On the plus size, we have all the file metadata to use to eliminate collisions: file name, modification time, file size. Where my mathematical abilities fall short is in determining how many collisions this is likely to actually remove, but I'm pretty sure it's the overwhelming majority of them. If you added the inode number to the mix... I don't think it's harmful to do that, actually, because in the great majority of cases if you rebuild your libraries they will all change (I saw no suggestion that cache info for libraries that have gone MIA would be retained, and nobody else's system does this AFAICT; I just looked at several). I walked myself through the birthday-paradox reasoning last night, then took some empirical samples to determine how often the other metrics for a given pair of object files would be the same. You can easily do the same thing yourself; it is extremely uncommon, as I think you will discover; if you add inode number to the mix -- if someone moves all of his system's shared libraries, regenerating the relcache is probably not unreasonable -- it's way beyond that. However, with no way to characterize how often CRC will actually collide under those constraints, I'm still left doing a lot of guessing. From a pragmatic point of view, it seems reasonable to use both a 32-bit CRC and a 32-bit sum computed using a completely different method, plus the metadata. We can't _really_ say how often there will be collisions, but I'd bet you an awful lot of money that you won't see one this decade. -- Thor Lancelot Simon tls@rek.tjls.com But as he knew no bad language, he had called him all the names of common objects that he could think of, and had screamed: "You lamp! You towel! You plate!" and so on. --Sigmund Freud