Subject: Re: copy performance
From: David Laight <firstname.lastname@example.org>
Date: 03/21/2002 12:08:59
On Thu, Mar 21, 2002 at 10:35:39AM +0000, Richard Earnshaw wrote:
> email@example.com said:
> > However adding an extra memory read:
> > 10: ldrb r4, [r1]
> > ldrb r4, [r0], #1
> > strb r4, [r1], #1
> > subs r2, r2, #1
> > bne 10b
> > puts the destination into the data cache and speeds it up
> > to 470
> But will slow things down on a machine with write-through caches, since
> now we will fetch a line into the cache (taking many cycles) that will
> never be used.
I don't see a massive problem in having cpu dependent code in the
kernel. It is a bigger problem for user space code.
Also the target data is quite likely to be needed shortly afterwards,
so having it in the cache may just have transfered the cost.
> I'm not sure that there is a generic solution here. Even the ARMv5 PLD
> instruction wouldn't help much.
pld is probably slightly easier than the ldr - especially since it wont
fault if you prefetch beyond the end of the buffer.
> > For aligned copies using ldmia/stmia loops forcing a read doesn't help
> > large copies.
> What happens if you 'prefetch' from address + 15?
All prefetch attempts of the ldm/stm fail (although having all the
buffer in the cache at the start is a win). I suspect this is because
the stm generates a 4 word burst - which is the same cycle as is used
to write out the dirty cache line.
I can see that optimising this code requires a reasonable benchmark
and a system where a new copyin (etc) can be loaded into an
There are certainly significant gains to be made though!
David Laight: firstname.lastname@example.org