port-arm: Re: copyin/out

Subject: Re: copyin/out
To: David Laight <david@l8s.co.uk>
From: Allen Briggs <briggs@wasabisystems.com>
List: port-arm
Date: 08/09/2002 16:54:54
On Fri, Aug 09, 2002 at 11:18:13AM +0100, David Laight wrote:
> You need to do a system wide benchmark for that.  It all depends
> of what you displace in order to include your unrolled loops.

I'm actually interested in what people observe on different ARM
architectures, hence my post to port-arm.  If this code isn't
better for some, then we need to do something different.  If it
is better, then I think we want to use it.

> > Basically, I've ditched the pte scan and I'm using ldr[b]t and str[b]t
> > to access user data.  I've also unrolled some loops and I've put in
> > some code to prefetch with the 'pld' instruction on XScale 
> 
> I got a significant benefit on SA1100 by doing a read ahead of the target
> address (to pull it into the data cache).  I can't remember the speeds
> I got (and no longer have a test system).  But this is the byte loop:
> 	ldrb    r4, [r0], #1
> 11:	subs    r2, r2, #1
> 	ldrneb  r5, [r1,#24]
> 	strb    r4, [r1], #1
> 	ldrneb  r4, [r0], #1
> 	bne     11b
> Only with the 'prefetch' did the order of the instructions matter.

That makes sense since your stalls were otherwise lost in the noise of
a cache miss-write-through cycle.  The 'prefetch' pulled the target
into the cache so you didn't have a cache miss and so the store went
to the cache instead of to memory.

> > With this, I'm seeing copyout run at about 63MB/s on a simple test
> > (dd if=/dev/zero of=/dev/null count=1024 bs=1024k).
> How fast did it run before?

Depends on the cache mode.  With it using standard write-back cache
(like on the SA-110), it was running closer to 40MB/s.  With the
write-allocate cacheline allocation policy, this was only slightly
better.

> Does this help?  Are there enough 0 length transfers for it to matter?

As Jason said, we're going to be profiling this.

> > 	 * Align destination to word boundary.
> > 	and	r6, r1, #0x3
> > 	ldr	pc, [pc, r6, lsl #2]
> > 	b	Lialend
> > 	.word	Lialend
> > 	.word	Lial1
> > 	.word	Lial2
> > 	.word	Lial3
> > Lial3:	ldrbt	r6, [r0], #1
> > 	sub	r2, r2, #1
> > 	strb	r6, [r1], #1
> > Lial2:	ldrbt	r7, [r0], #1
> > 	sub	r2, r2, #1
> > 	strb	r7, [r1], #1
> > Lial1:	ldrbt	r6, [r0], #1
> > 	sub	r2, r2, #1
> > 	strb	r6, [r1], #1
> > Lialend:
> 
> How about:
> 	ands	r6, r1, #3
> 	addne	pc, pc, r6 lsl #3
> 	b	Lialend
> 	nop
> 	nop
> 	ldrbt	r7, [r0], #1
> 	strb	r7, [r1], #1
> 	ldrbt	r7, [r0], #1
> 	strb	r7, [r1], #1
> 	ldrbt	r7, [r0], #1
> 	strb	r7, [r1], #1
> 	eor	r6, r6, #2
> 	sub	r2, r2, r6
> Lialend:

I'm not sure I've convinced myself that that's the same thing.  Also,
you'll have data dep stalls just using r7 there.  Have you tested this?

> > 	/* If few bytes left, finish slow. */
> > 	cmp	r2, #0x08
> > 	blt	Licleanup
> 
> Surely it is worth increasing the number of bytes we enter this
> path with to ensure that check never takes.

Possibly.  Although I rather suspect that we often call this code
with 4 or 8 bytes (granted, probably aligned).  This gets back to the
histograms...  :-)

> > 	/* If source is not aligned, finish slow. */
> > 	ands	r3, r0, #0x03
> > 	bne	Licleanup
> Maybe worth checking src and dest have same alignment earlier

I did this at first, but this increased the number of branches for
some paths in the code.

> Do you always want to align on the destination?

I'm not sure.

> For SA1100, if you can do 'stm' writes of 4 words then you don't
> need to worry about the destination being cached (unless you want
> the data soon).

That's interesting.  Can you elaborate some on this?  Or point me to
a specific location in the manual?

> Also aligning the source might be a win on
> because you could use ldm after an initial ldrt (saving cache).

Cache & code size.

> Also, like the byte align code, it ought to be possible to
> avoid the data read and the 'sub r2,r2,#2' in each case.

The sub is really cheap because you'd be stuck in a data stall there
anyway, I believe.

> I'd try to reduce the code size by only having the 'copy
> cacheline' present once.  Shouldn't be too hard!

This is the classic size/speed tradeoff, I believe.  Either
we have another branch or we may prefetch data we will never
need.

> A trully horrid loop!

Thanks, I think.  ;-)

-allen

-- 
 Allen Briggs                     briggs@wasabisystems.com
 http://www.wasabisystems.com/    Quality NetBSD CDs, Sales, Support, Service
NetBSD development for Alpha, ARM, M68K, MIPS, PowerPC, SuperH, XScale, etc...