Subject: Re: Accelerating memset/memcpy
To: None <simonb@wasabisystems.com>
From: Paul Koning <pkoning@equallogic.com>
List: port-mips
Date: 10/01/2002 11:38:50
>>>>> "Simon" == Simon Burge <simonb@wasabisystems.com> writes:

 Simon> On Tue, Oct 01, 2002 at 10:01:39AM +0000, Nicolas BOUQUET
 Simon> wrote:
 >> Hi,
 >> 
 >> I recently benchmarked the hardware memory subsystem of our
 >> MIPS-based board (currently using PMC RM5231), and I was surprised
 >> to see that memory writes were slower than memory reads. My
 >> benchmark routine uses KSEG0 with write back for its tests.
 >> 
 >> So I took my books and found the reason quickly: in my case,
 >> memory writes in a particular cacheline are preceded by a cache
 >> refill if the line was previously unused. But in my case, these
 >> cache refill are not needed since I write entire cachelines (I
 >> transfert large blocks of data and measure the time it takes).
 >> 
 >> RM5231's datasheet states that this behaviour can be corrected by
 >> issuing a "create dirty exclusive" cache operation on the lines
 >> concerned. Doing so effectively increased write throughput: I can
 >> write to memory at 125MBytes/s instead of 50MBytes/s.
 >> 
 >> So here comes my question/reflexion: could these modifications be
 >> applied to NetBSD kernel, for example through memset/memcpy
 >> routines ?

 Simon> Indeed, all (or just most?) MIPS32 and MIPS64 CPUs should able
 Simon> to take advantage of this too with their PREF instructions,
 Simon> and there a probably a number of `older' MIPS IV-style CPUs
 Simon> that have a similar operation available.

PREF and CACHE(create dirty exclusive) are very different.  The CACHE
instruction would let you avoid the cacheline fill when you're writing
a full cacheline; PREF only lets you move that fill earlier in time.
If you're memory-bound, PREF may produce a small performance
improvement, but CACHE will give a significantly larger improvement.

Unfortunately the create dirty exclusive operation is a
platform-dependent operation.  Some MIPS processors have it, some
(including some very recent ones) do not.

	   paul