Subject: Re: NetBSD/i386 processor recommendation
To: Ross Harvey <ross@teraflop.com>
From: Jonathan Stone <jonathan@DSG.Stanford.EDU>
List: port-i386
Date: 08/07/1997 16:49:03
 >You certainly achieved a giant P5 speedup on bcopies up to 10K,
 >however, the affect dies out abruptly right there. The P6 issue is with
 >purely dram writes, whereas your speedup affects only in-cache non-dram
 >bcopies. (Although there is a spillover affect if you leave some part
 >of the bcopy in the cache and never measure the time it takes to flush
 >it out to dram.)

The bcopy() is really Kevin's idea, after observing the impact of the
no-allocate-on-write cache policy.  It's not really mine at all, tho'
I did polish it from an aligned-block-copy to a full bcopy().

Please be careful with the attributions!


 >Anyway, I like your bcopy a lot, it gets a huge speedup on copies up
 >to about 10K, (and on MMX, even better), and a small speedup for some
 >additional length. It is not needed and won't help on the P6, which
 >already gets great in-cache numbers, but it should do little or no
 >harm, either.


The FreeBSD code does (apparently) better yet by using FP registers to
do 64-bit moves rather than 32-bit moves.  It shouldn't hurt on the
P6; I don't know about 486 or 386 systems. Writing through a
board-level cache probably helps, if the cache can do smarter
writeback of entire lines; but 486es before the DX3 and (sic) DX4 used
write-through, not write-back.

 >
 >This should definitely be in NetBSD for all those P5 users.

I suggested copying the jump-vector implementation FreeBSD uses.

 >For those of you who are interested, here are some very rough numbers:
 >(The variation in the numbers is mostly due to the difference between
 >L1 and L2 cache speeds, so you sort of can read it as L2-L1)

Uh, were these from the aligned-block-copy in Kevin's benchmark suite,
or from a full bcopy() like the FreeBSD code?