Subject: Re: NetBSD/i386 processor recommendation
To: Ross Harvey <ross@teraflop.com>
From: Jonathan Stone <jonathan@DSG.Stanford.EDU>
List: port-i386
Date: 08/07/1997 16:49:03
>You certainly achieved a giant P5 speedup on bcopies up to 10K,
>however, the affect dies out abruptly right there. The P6 issue is with
>purely dram writes, whereas your speedup affects only in-cache non-dram
>bcopies. (Although there is a spillover affect if you leave some part
>of the bcopy in the cache and never measure the time it takes to flush
>it out to dram.)
The bcopy() is really Kevin's idea, after observing the impact of the
no-allocate-on-write cache policy. It's not really mine at all, tho'
I did polish it from an aligned-block-copy to a full bcopy().
Please be careful with the attributions!
>Anyway, I like your bcopy a lot, it gets a huge speedup on copies up
>to about 10K, (and on MMX, even better), and a small speedup for some
>additional length. It is not needed and won't help on the P6, which
>already gets great in-cache numbers, but it should do little or no
>harm, either.
The FreeBSD code does (apparently) better yet by using FP registers to
do 64-bit moves rather than 32-bit moves. It shouldn't hurt on the
P6; I don't know about 486 or 386 systems. Writing through a
board-level cache probably helps, if the cache can do smarter
writeback of entire lines; but 486es before the DX3 and (sic) DX4 used
write-through, not write-back.
>
>This should definitely be in NetBSD for all those P5 users.
I suggested copying the jump-vector implementation FreeBSD uses.
>For those of you who are interested, here are some very rough numbers:
>(The variation in the numbers is mostly due to the difference between
>L1 and L2 cache speeds, so you sort of can read it as L2-L1)
Uh, were these from the aligned-block-copy in Kevin's benchmark suite,
or from a full bcopy() like the FreeBSD code?