Subject: Re: copy performance
To: <>
From: David Laight <david@l8s.co.uk>
List: port-arm
Date: 03/21/2002 12:08:59
On Thu, Mar 21, 2002 at 10:35:39AM +0000, Richard Earnshaw wrote:
> 
> david@l8s.co.uk said:
> >  However adding an extra memory read:
> > 10:	ldrb    r4, [r1]
> > 	ldrb    r4, [r0], #1
> > 	strb    r4, [r1], #1
> > 	subs    r2, r2, #1
> > 	bne     10b
> > puts the destination into the data cache and speeds it up
> > to 470 

> But will slow things down on a machine with write-through caches, since 
> now we will fetch a line into the cache (taking many cycles) that will 
> never be used.

I don't see a massive problem in having cpu dependent code in the
kernel.  It is a bigger problem for user space code.

Also the target data is quite likely to be needed shortly afterwards,
so having it in the cache may just have transfered the cost.

> I'm not sure that there is a generic solution here.  Even the ARMv5 PLD 
> instruction wouldn't help much.
pld is probably slightly easier than the ldr - especially since it wont
fault if you prefetch beyond the end of the buffer.

> > For aligned copies using ldmia/stmia loops forcing a read doesn't help
> > large copies.
> 
> What happens if you 'prefetch' from address + 15?

All prefetch attempts of the ldm/stm fail (although having all the
buffer in the cache at the start is a win). I suspect this is because
the stm generates a 4 word burst - which is the same cycle as is used
to write out the dirty cache line.

I can see that optimising this code requires a reasonable benchmark
and a system where a new copyin (etc) can be loaded into an
existing system!

There are certainly significant gains to be made though!

	David

-- 
David Laight: david@l8s.co.uk