Subject: Re: NetBSD/i386 processor recommendation
To: None <gary@wheel.tiac.net, jonathan@dsg.stanford.edu>
From: Ross Harvey <ross@teraflop.com>
List: port-i386
Date: 08/07/1997 12:34:25
>>> The P6 does unncessary read-for-ownership cycles even when not in SMP mode.
>>> It ruins the main memory write performance. However, the reads and in-cache
>>> writes are so much faster that this only affects things like long bcopys.
>>
>>>Shouldn't bcopy and friends use a read after every cache-line full of data
>>>written (considering alignement and all that fun)? IIRC that would cause a
>>>burst-read of the cache-line, a fill in the cache in exclusive mode and then
>>>a burst-write of the whole line back to main memory. Should be faster since
>>>P5...
>
[Jonathan Stone]
>Uh... I colllaborated on Kevin Lai's USENIX paper which points out the
>advantages of a read-cache-line-before-write strategy on Pentium chips
>with a no--alllocate-on-write-miss cache policy.
>
>I hacked up a bcopy() based on Kevin's and then ported the FreeBSD
>bcopy(), which does maringally (or even, if you prefer) better by
>using the FPU registers to issue 64-bit copies.
>
>I've sent copies of that to Frank van der Linden (Charles Hannum, the
>principal i386 portmasteer, discards all e-mail from me, so I haven't
>sent him a copy).
>
>Caveat emptor.
You certainly achieved a giant P5 speedup on bcopies up to 10K,
however, the affect dies out abruptly right there. The P6 issue is with
purely dram writes, whereas your speedup affects only in-cache non-dram
bcopies. (Although there is a spillover affect if you leave some part
of the bcopy in the cache and never measure the time it takes to flush
it out to dram.)
Anyway, I like your bcopy a lot, it gets a huge speedup on copies up
to about 10K, (and on MMX, even better), and a small speedup for some
additional length. It is not needed and won't help on the P6, which
already gets great in-cache numbers, but it should do little or no
harm, either.
This should definitely be in NetBSD for all those P5 users.
For those of you who are interested, here are some very rough numbers:
(The variation in the numbers is mostly due to the difference between
L1 and L2 cache speeds, so you sort of can read it as L2-L1)
CACHE AND DRAM SPEEDS
P5 cache P6 cache P5 dram P6 dram
MB/S 75-200 400-775 50-75 70 (W)
200 (R)
JONATHAN STONE'S FAST P5 BCOPY
1K 5K 50K 500K
Old bcopy MB/S 35 35 35 30
JS bcopy MB/S 250 260 50 30
----------------------
Ross Harvey Avalon Computer Systems, Inc. ross@teraflop.com
Santa Barbara http://www.teraflop.com