port-powerpc: Re: about powerpc version of in{,4}

Subject: Re: about powerpc version of in{,4}_cksum
To: enami tsugutomo <enami@sm.sony.co.jp>
From: Simon Burge <simonb@wasabisystems.com>
List: port-powerpc
Date: 07/30/2002 23:41:20
I've been talking with Bill Studenmund and Allen Briggs about the
powerpc in_cksum.c just today :-)  Bill has a testcase ("read a file
over NFS") where the checksum isn't calculated correctly...


On Tue, Jul 30, 2002 at 07:08:25PM +0900, enami tsugutomo wrote:

> There is some bug and some possible improvement for current powerpc
> version of in{,4}_cksum (which is found in
> sys/arch/powerpc/powerpc/in_cksum.c).  Highlight is:
> 
> 1) addze 7,7 or addze %1,%1 are used to clear carry bit, but they
>    aren't correct.  The former fails to clear if junk register r7
>    happen to contain 0xffffffff, and the latter may crobber non-junk
>    (i.e., necessary) register.

I'm not sure that "addic 0,0,0" is going to do the right thing either
for clearing the carry.  For roughly this code fragment

>                       n = mlen >> 6;
> !                     __asm __volatile(
> !                         "addic 0,0,0;"              /* clear carry */
> !                         "mtctr %1;"                 /* load loop count */
> !                         "1:"
> !                         "lwz 7,4(%2);"              /* load current data

here's the "objdump -d"

	 12c:   7f e0 36 70     srawi   r0,r31,6
	 130:   39 66 ff fc     addi    r11,r6,-4
	 134:   30 00 00 00     addic   r0,r0,0
	 138:   7c 09 03 a6     mtctr   r0
	 13c:   80 eb 00 04     lwz     r7,4(r11)

gcc has put "n" in r0, so the "addic 0,0,0" will add the carry bit to
"n".  Allen suggested the following:

	addi 7,0,0; addic 7,7,0;

"make sure r7 is zero, then add 0 to it w/ carry".  We choose r7 since
that is one of the regs we mark as clobbered.

> 2) When adjusting to 4 byte boundary, just adding 16bit value to the
>    variable `sum' isn't enough, since the `sum' may have full 32bit
>    value there, depending on how a packet is divided into mbufs.  So,
>    we need to care carry bit.  This actually prevented my Mac
>    (g4-500dp) from netbooting.  (we can REDUCE instead but it results
>    longer instructions).

I wonder if just:

		if ((3 & (long) w) && (mlen > 0)) {
			REDUCE1;
			if ((1 & (long) w)) {
				sum <<= 8;
				s_util.c[0] = *w++;
				mlen--;
				byte_swapped = 1;
			}
			if ((2 & (long) w) && (mlen > 1)) {
				sum += *(uint16_t *)w;
				w += 2;
				mlen -= 2;
			}
		}

wouldn't be better?  It also occurs to me that only the last REDUCE
has to be a REDUCE; the others can be a REDUCE1 - we don't care in the
intermediate code whether we've reduced to a 16-bit or 17-bit value.

> 3) In asm statemnt, constraint letter "b" (base register) should be
>    used instead of "r" for pointer operand.

Ok, I didn't see the ppc constraints.  Does this make any real-world
differences?

> 4) Load with update is slower than ordinary load (+ add) depending on
>    implementation (this is written in powerpc programming manual).
>    With appended changes below, in4_cksum for a paticular packet
>    (which is a nfs packet I've used to debug 2) above) runs about 15%
>    faster on my Mac.  I'm not sure how this affects other powerpc
>    variant though.

On the IBM 405GP, I get the following from a test harness using the
current code ("old") and your patches ("new"):

walnut:~/src/cksum.ppc 24> ./t 2000000 10 1532
align:2 old=3e8d new=3e8d oldt=9.523496 sec (952 clocks) newt=9.543340 sec (952 clocks)
align:2 old=9aa1 new=9aa1 oldt=9.523538 sec (952 clocks) newt=9.534748 sec (953 clocks)
align:1 old=b3e8 new=b3e8 oldt=9.691512 sec (969 clocks) newt=9.695219 sec (967 clocks)
align:0 old=9eb1 new=9eb1 oldt=9.432990 sec (941 clocks) newt=9.434793 sec (940 clocks)
align:1 old=c4ed new=c4ed oldt=9.691568 sec (968 clocks) newt=9.694795 sec (969 clocks)
align:0 old=09a2 new=09a2 oldt=9.433257 sec (943 clocks) newt=9.442578 sec (943 clocks)
align:3 old=587e new=587e oldt=9.713674 sec (971 clocks) newt=9.704928 sec (969 clocks)
align:1 old=21a4 new=21a4 oldt=9.691515 sec (967 clocks) newt=9.695213 sec (968 clocks)
align:0 old=7576 new=7576 oldt=9.433068 sec (942 clocks) newt=9.442846 sec (943 clocks)
align:0 old=6bdd new=6bdd oldt=9.433201 sec (943 clocks) newt=9.434634 sec (942 clocks)

That command line is 2000000 iterations over ten different random 1532
byte mbufs, with a random alignment (the first column is "alignment %
4") for each mbuf.  So for the 405GP, there's ~no difference in speed.
Given that there is a marked difference with your Mac, it would seem
that switching to non-update loads seems like a sound decision.  It
would be interesting to get benchmarks for other CPU models.

> Comments?

Thanks for looking at this!  It'll be interesting to see if it fixes the
problem Bill was looking at.

Simon.
--
Simon Burge                                   <simonb@wasabisystems.com>
NetBSD Development, Support and Service:   http://www.wasabisystems.com/