Subject: 4xx copyin/copyout [Was: CVS commit: src/sys/arch/powerpc/ibm4xx]
To: Juergen Hannken-Illjes <hannken@eis.cs.tu-bs.de>
From: Simon Burge <simonb@NetBSD.org>
List: source-changes
Date: 11/28/2007 02:22:13
Juergen Hannken-Illjes wrote:

> On Thu, Nov 22, 2007 at 01:33:08PM +0000, Herb Peyerl wrote:
> > 
> > Module Name:	src
> > Committed By:	hpeyerl
> > Date:		Thu Nov 22 13:33:08 UTC 2007
> > 
> > Modified Files:
> > 	src/sys/arch/powerpc/ibm4xx: trap.c
> > 
> > Log Message:
> > Optimize copyin/copyout to transfer as many words as possible before doing
> > residual bytes. This improves small transfers. As a result, we can avoid
> > doing bigcopyin/bigcopyout until len>1024 instead of len>256.
> > 
> > Reviewed by: simonb.
> > 
> > (everybody run, Herb's in the kernel again).
> 
> This breaks with alignment trap on the 1st copyout for evbppc/explora:
> 
> 	kaddr=0x2fda88 udaddr=0xfffebff5 len=11
> 
> Trap is at this line:
> 
> 	"   stw %[tmp],0(%[udaddr]);"       /* Store user word */

The 405 core manual says that the only intructions that can issue
alignment traps are dcbz, dcread, lwarx and stwcx.  I found a manual
for the 403CGX and it says "All data operands referenced by the Storage
Reference instructions (loads/stores) must be aligned on a corresponding
operand-size boundary." so that explains why I did see any issues when I
tested Herb's patches on a 405 Walnut.

I've just tried using lswi (Load String Word Immediate) in place if
lwz (and similar for the stores) to fix this without any significant
rewriting, and on a Walnut it gets the same performance as just using
lwz/stw.  Does the trailing patch work for you on the Explora?

If this works, we can possibly also look at unrolling the word loop a
bit since the load/store string instructions can do up to 32-bytes per
instruction.  Does anyone know how to request a number of consecutive
registers with gcc asm constraints?  If not, it might be easier to break
these two assembly fragments out to their own .S file...

If this patch doesn't fix the Explora, we can just add a alignment check
(if on a 403) and skip the word-at-a-time loop, although the trailing
loop will need to be updated to not just use "len % 4" bytes.

Cheers,
Simon.


Index: trap.c
===================================================================
RCS file: /cvsroot/src/sys/arch/powerpc/ibm4xx/trap.c,v
retrieving revision 1.46
diff -d -p -u -r1.46 trap.c
--- trap.c	22 Nov 2007 13:33:08 -0000	1.46
+++ trap.c	27 Nov 2007 15:11:07 -0000
@@ -438,11 +438,11 @@ copyin(const void *udaddr, void *kaddr, 
 		"   beq-  2f;"              /* No words. Go do bytes */
 		"   mtctr %[count];"
 		"1: mtpid %[ctx]; sync;"
-		"   lwz %[tmp],0(%[udaddr]);"       /* Load user word */
+		"   lswi %[tmp],%[udaddr],4;"       /* Load user word */
 		"   addi %[udaddr],%[udaddr],0x4;"  /* next udaddr word */
 		"   sync; isync;"
 		"   mtpid %[pid];sync;"
-		"   stw %[tmp],0(%[kaddr]);"        /* Store kernel word */
+		"   stswi %[tmp],%[kaddr],4;"        /* Store kernel word */
 		"   dcbf 0,%[kaddr];"           /* flush cache */
 		"   addi %[kaddr],%[kaddr],0x4;"    /* next udaddr word */
 		"   sync; isync;"
@@ -537,11 +537,11 @@ copyout(const void *kaddr, void *udaddr,
 		"   beq-  2f;"              /* No words. Go do bytes */
 		"   mtctr %[count];"
 		"1: mtpid %[pid];sync;"
-		"   lwz %[tmp],0(%[kaddr]);"        /* Load kernel word */
+		"   lswi %[tmp],%[kaddr],4;"        /* Load kernel word */
 		"   addi %[kaddr],%[kaddr],0x4;"    /* next kaddr word */
 		"   sync; isync;"
 		"   mtpid %[ctx]; sync;"
-		"   stw %[tmp],0(%[udaddr]);"       /* Store user word */
+		"   stswi %[tmp],%[udaddr],4;"       /* Store user word */
 		"   dcbf 0,%[udaddr];"          /* flush cache */
 		"   addi %[udaddr],%[udaddr],0x4;"  /* next udaddr word */
 		"   sync; isync;"