Subject: Re: NetBSD/i386 processor recommendation
To: None <gary@wheel.tiac.net, jonathan@dsg.stanford.edu>
From: Ross Harvey <ross@teraflop.com>
List: port-i386
Date: 08/07/1997 12:34:25
>>> The P6 does unncessary read-for-ownership cycles even when not in SMP mode.
>>> It ruins the main memory write performance. However, the reads and in-cache
>>> writes are so much faster that this only affects things like long bcopys.
>>
>>>Shouldn't bcopy and friends use a read after every cache-line full of data
>>>written (considering alignement and all that fun)? IIRC that would cause a
>>>burst-read of the cache-line, a fill in the cache in exclusive mode and then
>>>a burst-write of the whole line back to main memory. Should be faster since
>>>P5...
>

 [Jonathan Stone]

>Uh... I colllaborated on Kevin Lai's USENIX paper which points out the
>advantages of a read-cache-line-before-write strategy on Pentium chips
>with a no--alllocate-on-write-miss cache policy.
>
>I hacked up a bcopy() based on Kevin's and then ported the FreeBSD
>bcopy(), which does maringally (or even, if you prefer) better by
>using the FPU registers to issue 64-bit copies. 
>
>I've sent copies of that to Frank van der Linden (Charles Hannum, the
>principal i386 portmasteer, discards all e-mail from me, so I haven't
>sent him a copy).
>
>Caveat emptor.

You certainly achieved a giant P5 speedup on bcopies up to 10K,
however, the affect dies out abruptly right there. The P6 issue is with
purely dram writes, whereas your speedup affects only in-cache non-dram
bcopies. (Although there is a spillover affect if you leave some part
of the bcopy in the cache and never measure the time it takes to flush
it out to dram.)

Anyway, I like your bcopy a lot, it gets a huge speedup on copies up
to about 10K, (and on MMX, even better), and a small speedup for some
additional length. It is not needed and won't help on the P6, which
already gets great in-cache numbers, but it should do little or no
harm, either.

This should definitely be in NetBSD for all those P5 users.

For those of you who are interested, here are some very rough numbers:
(The variation in the numbers is mostly due to the difference between
L1 and L2 cache speeds, so you sort of can read it as L2-L1)

			CACHE AND DRAM SPEEDS

		P5 cache	P6 cache	P5 dram		P6 dram

	MB/S	75-200		400-775		50-75		 70 (W)		
								200 (R)


			JONATHAN STONE'S FAST P5 BCOPY
	
				1K	5K	50K	500K
		
	Old bcopy MB/S		 35	 35	35	30

	JS bcopy  MB/S		250	260	50	30
----------------------
Ross Harvey	Avalon Computer Systems, Inc.		  ross@teraflop.com
		Santa Barbara	 		    http://www.teraflop.com