Subject: Re: RelCache (aka ELF prebinding) news
To: Bang Jun-Young <junyoung@netbsd.org>
From: Thor Lancelot Simon <tls@rek.tjls.com>
List: tech-kern
Date: 12/03/2002 02:27:02
On Tue, Dec 03, 2002 at 11:29:19AM +0900, Bang Jun-Young wrote:
> On Mon, Dec 02, 2002 at 06:50:58AM -0500, Ty Sarna wrote:
> > 
> > For example, I just compiled bin/cat from the 1.6 sources. I download
> > bin/cat from ftp.netbsd.org and it is identical to mine. There are a
> > zillion small ways in which this is useful. Don't screw it up.
> 
> (You do have a point ;-)
> 
> I agree. Timestamp in ELF binary doesn't seem to be a good idea, either.
> 
> Now I'm wondering: is there any 64 bit long checksum available? Or is
> it safe to two different 32 bit checksum methods for a single file?

I'm curious how you decided that the likelihood of a random collision was
too great with a 32-bit sum.  I started to make some rather simplistic
mathematical arguments here but after investigating the literature on
CRC error rates for about an hour, all I'm really sure of is that my
calculations were wrong.

However, there are significantly better 32-bit CRC polynomials than the
usual one; the one used by iSCSI is particularly good, against the kind
of errors seen in data communications links at least.  Whether it is
better for this purpose, I couldn't really say, because I cannot answer
even the weak question "given two random bitstrings of length N, what is
the probability that they will have the same 32-bit CRC result with
polynomial M", much less "given two ELF object files of length N, what is
the probability that they will have the same 32-bit CRC result with
polynomial M", which is what we care about (since the relcache can
obviously use the file size to eliminate all other collisions).

Note that the modification time of the file can also be used to further
reduce the number of random collisions, quite possibly close enough to
zero that nobody will ever see one -- but I can't _prove_ this, or even
reason about it with any real basis except for a Very Strong Hunch.  
Anyone feeling mathematical?

I really think that a 32-bit sum suffices unless you really believe that
the inputs are likely to be such that CRC-32 will produce poor results.  But
if you really want more bits, you could use both CRC-32 and another 32-bit
(or shorter) sum such as an Adler-32 or Fletcher-32 sum, or just use a
longer sum that is faster to compute than CRC, such as a 40, 48, or 64-bit
Fletcher sum.

Note that the Fletcher-32 sum should be significantly faster to compute
than the Adler-32 sum, and should serve just fine for this purpose, so if
you decide to use CRC32 and something else, it would be a good choice.

For some good references on checksums and CRCs, see:

http://www.haifa.il.ibm.com/satran/ips/draft-sheinwald-iscsi-crc-01.txt
http://www.diskdrive.com/reading-room/white-papers/Definitive_CRC32C_Paper.pdf

http://www.ece.cmu.edu/~ips/archive/draft-cavanna-iscsi-crc-vs-cksum-01.txt

Note that the last of those three references contains some mistakes in its
reasoning about the probability of various kinds of checksum failures, but
is still a good general overview of the different kinds of sums available,
and is less mathematically daunting than the first two.

Also note that the commonly-used "CRC-64", originally from Matt Dillon's
"diabolo" code, I believe, uses a rather poorly chosen polynomial (to begin
with, it's really only 56 bits -- one end is all zeroes!) and is not a great
choice.

-- 
 Thor Lancelot Simon	                                      tls@rek.tjls.com
   But as he knew no bad language, he had called him all the names of common
 objects that he could think of, and had screamed: "You lamp!  You towel!  You
 plate!" and so on.              --Sigmund Freud