Subject: tuning IP checksumming code...
To: Charles M. Hannum <mycroft@mit.edu>
From: Jonathan Stone <jonathan@dsg.stanford.edu>
List: port-i386
Date: 07/17/1996 18:09:04
hi,

I've just looked at the source and the CVS logs for in_cksum.s.

Experimentally, using the 1.2 in_cksum.c or the tuned in_cksum.s seems
to make no significant performance difference on the P120s here; the
tuned code may be marginally slower.  Not having seen the
alternatives, I'm not sure which flavour of x86 in_cksum.s is tuned
for.

One option (perhaps best left unmentioned) is to link in several
versions of the in_cksum code tuned for different CPUs (386, 486,
Pentium, ... as in_cksum_386, in_cksum_486, or whatever); to put some
version in as the "default", and early in boot time, to identify the
CPU type and patch in the most effective copy for the machine on which
the kernel is actually running.  Function pointers are the obvious
alternative; but add yet another cache miss to every invocation of
in_cksum().

As a minor question, the  literature in the field suggests that,
at least on Mips and Alpha chips, the best size to unroll in_cksum
is not 128 or some close power of 2, but rather the number of bytes of
data in an mbuf (MLEN or MHLEN), since the two are usually relatively
prime.

Is there something about 4.4bsd, NetBSD, or x86 pipelines that
invalidate this conventional wisdom (IIRC, Kay and Pasquale, which
dates back some years and was done on a 4.2bsd-ish system without
pkthdrs in mbufs. They advised unrolling loops to MLEN bytes' worth.)

Perhaps the dynamic occurence of mbuf chains relative to mbuf clusters has
changed since then?


NB: for those who care, the latest Linux 2.0.x has a TCP over loopback
throughput that's about 25% faster than NetBSD on a P/120.  (That's
with a ~3.5k MTU on the linux lo0, and 32k on NetBSD.  Finding 10%
improvements here and there on NetBSD/i386 seem really quite attractive.