Subject: Re: Xscale optimisations
To: None <>
From: David Laight <>
List: port-arm
Date: 10/14/2003 14:35:58
> StrongARM load/store multiple instructions are expanded in the pipeline 
> into a sequence of equivalent load/store word operations (which is why 
> they take a long time to *not* execute if the condition fails).  A 
> sequence of stores that miss the cache will go direct to the write buffer. 
>  Provided that write-coalescing is enabled, this will be used to form a 
> burst on the memory bus.

Mmmm IIRC we only ever saw bursts of the memory bus for cache line writes.
(Although it wsn't me driving the analiser that day.)

I know I got faster memcpy (on sa1100) by fetching the target buffer
into the data cache (an lda offset by a magic number would do the trick,
didn't stall since the target data was never used!)

I also wonder about writing the misaligned tail (esp. of memset)
before doing the bulk write.  Gave an improvement for i386 kernel memset.
(although the misaligned memory support makes it a lot easier there)


David Laight: