Subject: Re: RelCache (aka ELF prebinding) news
To: None <tls@rek.tjls.com>
From: Jonathan Stone <jonathan@DSG.Stanford.EDU>
List: tech-userlevel
Date: 12/03/2002 14:39:10
In message <20021203072702.GA22049@rek.tjls.com>Thor Lancelot Simon writes
>On Tue, Dec 03, 2002 at 11:29:19AM +0900, Bang Jun-Young wrote:

[...]

>I'm curious how you decided that the likelihood of a random collision was
>too great with a 32-bit sum.  I started to make some rather simplistic
>mathematical arguments here but after investigating the literature on
>CRC error rates for about an hour, all I'm really sure of is that my
>calculations were wrong.
>
>However, there are significantly better 32-bit CRC polynomials than the
>usual one; 

Nope. Really, nope. This is a common fallacy[*]. But unless  you
specify some particular error model against which you wish to
protect[*], there are *no*, repeat *no*, grounds for claiming one
32-bit CRC (or other error check) is better than any other.

In the absenc of any description of how the data is distributed, and
of the kinds of ``errors'' (CRC collisions, in this arena), then the
only metric available for comparison is a combinatoric metric.  And
very simple combinatorics show that any function (in the mathematical
sense: a function is a relation R(x, y) where for all x, y1, y2, (R(x,
y1) and R(x, y2)) implies y1=y2).  is as good as any other function.
Chapter 3 of my dissertation gives a formal exposition.

[*]: I'm doing TLS an injustice here, because he did go on to refer
to some particular error model.



>the one used by iSCSI is particularly good, against the kind
>of errors seen in data communications links at least.  

>Whether it is
>better for this purpose, I couldn't really say, because I cannot answer
>even the weak question "given two random bitstrings of length N, what is
>the probability that they will have the same 32-bit CRC result with
>polynomial M", much less "given two ELF object files of length N, what is
>the probability that they will have the same 32-bit CRC result with
>polynomial M", which is what we care about (since the relcache can
>obviously use the file size to eliminate all other collisions).

If N is much bigger than the order of the generating polynomial of M,
then it doesn't matter.  The method used by the EE communications
researchers -- Castagnoli [1], and Wolf's [2,3]group at San Diego(?)
-- is (via some heavy mathematical tools) to figure out the very
lowest Hamming-distance error which a given CRC generator-polynomial
fails to detect. To put it more concretely: look at all possible XOR
error strings, of length up to 2^N for an N-bit CRC, and for a given
generator, find the XOR error-syndrome with the lowest number of one
bits which the given generator fails to detect.  This approach is
based on an implicit assumption that errors are single-bit or short
bursts, with a distribution close to expnentional (or Bernoulli, if
you want to get picky).


This metric for CRCs is great for link-level errors, but isn't really
relevant to using CRCs as a placebo for a one-way hash: the ``error
models'' are not commensurate.  If you really want a CRC (rather than,
say, md4?) then generators with a low number of 1 bits allow for fast,
efficient software implementations [4].  But CRCs are easily
invertible: I can smash any particular bits in the message and (if the
CRC is known) recompute a valid CRC.  Then one should ask: why choose
a CRC over some other invertible checksum, say a 32-bit
one's-complement sum, a 32-bit Fletcher sum, or Adler-32?

I've computed Sum p(i) log_P(i) (Shannon entropy) of various 32-bit
checksums over large filesystems (i.e., i ranges over the checksum
values of all file in the filesystem). If pressed I can dig up the
data.  If the application doesn't require non-invertibility, and the
input is file data, then a 32-bit Fletcher sum (two 16-bit sums) is
_probably_ a good choice. Looking at the sizes of my /usr/bin,
Adler-32 is _strongly_ contra-indicated by the significant number of
small files (as discussed in chapter 6 of my dissertation, and
summarised in RFC3309.)  For the specific case of ELF, a 32-bit one's
complement sum is just as good.  (For arbitrary files it is a poor
chioce; the intutition is that you can't tell 0xFFFFFFFF from
0x00000000, and that crops up more than was expected. See our
SIGCOMM 95 paper on checksums, or the longer ToN paper, for measurements
against files in real filesystems.)


... and apologies for distorting the signal-to-noise ratio.

[1] G. Castagnoli, S. Braeuer and M. Herrman, "Optimization of
Cyclic Redundancy-Check Codes with  24 and 32 Parity Bits",
IEEE Transactions on  Communications, Vol. 41, No. 6, June 1993

[2] K. Wolf and R. D. Blackeney, An exact evaluatoin of the
probability of Undetected Error for certaion shortened binary CRC
codes, MILCOMM Proceedings, 1988

[3] J.K Wolf and Dexter Chun, The single burst error detection performance of binary cyclic codes, IEEE Transactions on Communications, 
  volume 42, number 1, pp. 11-13,   January 1994.

[4] D. Feldmeier, Fast software implementation of error detection codes,
IEE Transactions on Networking, volume 3, number 6, pp. 640--651,
December, 1995.