Subject: Re: tuning IP checksumming code...
To: Jonathan Stone <jonathan@dsg.stanford.edu>
From: Charles M. Hannum <mycroft@mit.edu>
List: tech-net
Date: 07/17/1996 21:59:40
Jonathan Stone <jonathan@DSG.Stanford.EDU> writes:

> 
> Experimentally, using the 1.2 in_cksum.c or the tuned in_cksum.s seems
> to make no significant performance difference on the P120s here; the
> tuned code may be marginally slower.

I suspect that your test harness is not very good, then.  I test with
a wide range of mbuf alignment and size constraints, and the new
version is in almost every case faster.  (There are a few degenerate
cases where it's slightly slower, but these can't occur in practice.)

> As a minor question, the  literature in the field suggests that,
> at least on Mips and Alpha chips, the best size to unroll in_cksum
> is not 128 or some close power of 2, but rather the number of bytes of
> data in an mbuf (MLEN or MHLEN), since the two are usually relatively
> prime.
> 
> Is there something about 4.4bsd, NetBSD, or x86 pipelines that
> invalidate this conventional wisdom (IIRC, Kay and Pasquale, which
> dates back some years and was done on a 4.2bsd-ish system without
> pkthdrs in mbufs. They advised unrolling loops to MLEN bytes' worth.)

There's something about modern *caches* that invalidates that
`wisdom'.  The large loops are highly optimized for 486 and Pentium
cache loading behaviour, so that you only get one stall per cache line
(unlike, for example, the OpenBSD version which stalls at least twice
per cache line).