Subject: Re: Xscale optimisations
To: David Laight <david@l8s.co.uk>
From: Richard Earnshaw <rearnsha@arm.com>
List: port-arm
Date: 10/14/2003 15:24:48
> > StrongARM load/store multiple instructions are expanded in the pipeline 
> > into a sequence of equivalent load/store word operations (which is why 
> > they take a long time to *not* execute if the condition fails).  A 
> > sequence of stores that miss the cache will go direct to the write buffer. 
> >  Provided that write-coalescing is enabled, this will be used to form a 
> > burst on the memory bus.
> 
> Mmmm IIRC we only ever saw bursts of the memory bus for cache line writes.
> (Although it wsn't me driving the analiser that day.)

Hmm, yes, I suspect I was mistaken on that.   The SA110 timing apps note 
does seem to confirm your observations.

> 
> I know I got faster memcpy (on sa1100) by fetching the target buffer
> into the data cache (an lda offset by a magic number would do the trick,
> didn't stall since the target data was never used!)

Which would be faster would probably depend on the relative 
sequential/non-sequential times and the number of words to be written to a 
line.  Plus some compensation for the fact that other useful data will 
likely be cast out of the cache.  It is believable that 2(N+7S) < 8N (ie 
2.33 S < N) for many memory systems and thus that fetching a line into 
cache would most likely be more efficient than writing to memory that was 
out of the cache.

Actually, the DNARD PAL comments suggest it's more complicated than that: 
AFAICT a cache line fill will take 14 clock ticks and a line write 12 
clocks.  8 individual stores could take as many as 56 clocks, so there 
would be a clear win to pre-fetching the line (potentially a factor 4 
performance improvement).

R.