Subject: Re: copy performance
To: David Laight <david@l8s.co.uk>
From: Richard Earnshaw <rearnsha@arm.com>
List: port-arm
Date: 03/21/2002 10:35:39
david@l8s.co.uk said:
>  However adding an extra memory read:
> 10:	ldrb    r4, [r1]
> 	ldrb    r4, [r0], #1
> 	strb    r4, [r1], #1
> 	subs    r2, r2, #1
> 	bne     10b
> puts the destination into the data cache and speeds it up
> to 470 


But will slow things down on a machine with write-through caches, since 
now we will fetch a line into the cache (taking many cycles) that will 
never be used.


I'm not sure that there is a generic solution here.  Even the ARMv5 PLD 
instruction wouldn't help much.


david@l8s.co.uk said:
> For aligned copies using ldmia/stmia loops forcing a read doesn't help
> large copies.  However short copies (ie ones where the source and
> destination stay in the cache) speed up by a factor of 4 if the
> destination is in the data cache. (the source was always cached during
> this test.) 

What happens if you 'prefetch' from address + 15?  Don't forget that 
unless all your lines are cache-length aligned, then the prefetch for the 
store will only fetch part of the area you are writing to.  If you 
'prefetch' from a higher address, then over a long copy you will ensure 
that most write data goes into the cache.

R.