Subject: Re: copyin/out
To: <>
From: David Laight <>
List: port-arm
Date: 08/09/2002 11:18:13
> My three main concerns are:
> 	1) how does it work on other ARM architectures
> 	2) is the code too large for the more limited
> 	   of the arm32 archs?

You need to do a system wide benchmark for that.  It all depends
of what you displace in order to include your unrolled loops.

I'm also not actually sure (and it is difficult to guess) whether
the code is likely to be in the cache when you start.  If not
you need to allow for the memory fetch times of the instructions.
This can mean that code loops are faster than table lookups.
> 	3) Are there large, unaligned data copies going
> 	   through the copyin/copyout path?
> Basically, I've ditched the pte scan and I'm using ldr[b]t and str[b]t
> to access user data.  I've also unrolled some loops and I've put in
> some code to prefetch with the 'pld' instruction on XScale 

I got a significant benefit on SA1100 by doing a read ahead of the target
address (to pull it into the data cache).  I can't remember the speeds
I got (and no longer have a test system).  But this is the byte loop:
	ldrb    r4, [r0], #1
11:	subs    r2, r2, #1
	ldrneb  r5, [r1,#24]
	strb    r4, [r1], #1
	ldrneb  r4, [r0], #1
	bne     11b
Only with the 'prefetch' did the order of the instructions matter.

> With this, I'm seeing copyout run at about 63MB/s on a simple test
> (dd if=/dev/zero of=/dev/null count=1024 bs=1024k).

How fast did it run before?

Some heavily snipped comments..

> 	/* Quick exit if length is zero */	
> 	teq	r2, #0
> 	moveq	r0, #0
> 	moveq	pc, lr
Does this help?  Are there enough 0 length transfers for it to matter?

> 	 * Align destination to word boundary.
> 	and	r6, r1, #0x3
> 	ldr	pc, [pc, r6, lsl #2]
> 	b	Lialend
> 	.word	Lialend
> 	.word	Lial1
> 	.word	Lial2
> 	.word	Lial3
> Lial3:	ldrbt	r6, [r0], #1
> 	sub	r2, r2, #1
> 	strb	r6, [r1], #1
> Lial2:	ldrbt	r7, [r0], #1
> 	sub	r2, r2, #1
> 	strb	r7, [r1], #1
> Lial1:	ldrbt	r6, [r0], #1
> 	sub	r2, r2, #1
> 	strb	r6, [r1], #1
> Lialend:

How about:
	ands	r6, r1, #3
	addne	pc, pc, r6 lsl #3
	b	Lialend
	ldrbt	r7, [r0], #1
	strb	r7, [r1], #1
	ldrbt	r7, [r0], #1
	strb	r7, [r1], #1
	ldrbt	r7, [r0], #1
	strb	r7, [r1], #1
	eor	r6, r6, #2
	sub	r2, r2, r6

Which, in particular, saves pulling a chunk of the code into
the data cache.

> 	/* If few bytes left, finish slow. */
> 	cmp	r2, #0x08
> 	blt	Licleanup

Surely it is worth increasing the number of bytes we enter this
path with to ensure that check never takes.

> 	/* If source is not aligned, finish slow. */
> 	ands	r3, r0, #0x03
> 	bne	Licleanup

Maybe worth checking src and dest have same alignment earlier
> 	/*
> 	 * Align destination to cacheline boundary.
> 	 * If source and destination are nicely aligned, this can be a big
> 	 * win.  If not, it's still cheaper to copy in groups of 32 even if
> 	 * we don't get the nice cacheline alignment.
> 	 */

Do you always want to align on the destination?
For SA1100, if you can do 'stm' writes of 4 words then you don't
need to worry about the destination being cached (unless you want
the data soon).  Also aligning the source might be a win on
because you could use ldm after an initial ldrt (saving cache).

Also, like the byte align code, it ought to be possible to
avoid the data read and the 'sub r2,r2,#2' in each case.

> 	 * This loop basically works out to:
> 	 * do {
> 	 * 	prefetch-next-cacheline(s)
> 	 *	bytes -= 0x20;
> 	 *	copy cacheline
> 	 * } while (bytes >= 0x40);
> 	 * bytes -= 0x20;
> 	 * copy cacheline

I'd try to reduce the code size by only having the 'copy
cacheline' present once.  Shouldn't be too hard!

> Licleanup:
> 	and	r6, r2, #0x3
> 	ldr	pc, [pc, r6, lsl #2]
> 	b	Licend
> 	.word	Lic4
> 	.word	Lic1
> 	.word	Lic2
> 	.word	Lic3
> Lic4:	ldrbt	r6, [r0], #1
> 	sub	r2, r2, #1
> 	strb	r6, [r1], #1
> Lic3:	ldrbt	r7, [r0], #1
> 	sub	r2, r2, #1
> 	strb	r7, [r1], #1
> Lic2:	ldrbt	r6, [r0], #1
> 	sub	r2, r2, #1
> 	strb	r6, [r1], #1
> Lic1:	ldrbt	r7, [r0], #1
> 	subs	r2, r2, #1
> 	strb	r7, [r1], #1
> Licend:
> 	bne	Licleanup

A trully horrid loop!


David Laight: