Subject: Re: about powerpc version of in{,4}_cksum
To: enami tsugutomo <enami@sm.sony.co.jp>
From: Simon Burge <simonb@wasabisystems.com>
List: port-powerpc
Date: 07/30/2002 23:41:20
I've been talking with Bill Studenmund and Allen Briggs about the
powerpc in_cksum.c just today :-) Bill has a testcase ("read a file
over NFS") where the checksum isn't calculated correctly...
On Tue, Jul 30, 2002 at 07:08:25PM +0900, enami tsugutomo wrote:
> There is some bug and some possible improvement for current powerpc
> version of in{,4}_cksum (which is found in
> sys/arch/powerpc/powerpc/in_cksum.c). Highlight is:
>
> 1) addze 7,7 or addze %1,%1 are used to clear carry bit, but they
> aren't correct. The former fails to clear if junk register r7
> happen to contain 0xffffffff, and the latter may crobber non-junk
> (i.e., necessary) register.
I'm not sure that "addic 0,0,0" is going to do the right thing either
for clearing the carry. For roughly this code fragment
> n = mlen >> 6;
> ! __asm __volatile(
> ! "addic 0,0,0;" /* clear carry */
> ! "mtctr %1;" /* load loop count */
> ! "1:"
> ! "lwz 7,4(%2);" /* load current data
here's the "objdump -d"
12c: 7f e0 36 70 srawi r0,r31,6
130: 39 66 ff fc addi r11,r6,-4
134: 30 00 00 00 addic r0,r0,0
138: 7c 09 03 a6 mtctr r0
13c: 80 eb 00 04 lwz r7,4(r11)
gcc has put "n" in r0, so the "addic 0,0,0" will add the carry bit to
"n". Allen suggested the following:
addi 7,0,0; addic 7,7,0;
"make sure r7 is zero, then add 0 to it w/ carry". We choose r7 since
that is one of the regs we mark as clobbered.
> 2) When adjusting to 4 byte boundary, just adding 16bit value to the
> variable `sum' isn't enough, since the `sum' may have full 32bit
> value there, depending on how a packet is divided into mbufs. So,
> we need to care carry bit. This actually prevented my Mac
> (g4-500dp) from netbooting. (we can REDUCE instead but it results
> longer instructions).
I wonder if just:
if ((3 & (long) w) && (mlen > 0)) {
REDUCE1;
if ((1 & (long) w)) {
sum <<= 8;
s_util.c[0] = *w++;
mlen--;
byte_swapped = 1;
}
if ((2 & (long) w) && (mlen > 1)) {
sum += *(uint16_t *)w;
w += 2;
mlen -= 2;
}
}
wouldn't be better? It also occurs to me that only the last REDUCE
has to be a REDUCE; the others can be a REDUCE1 - we don't care in the
intermediate code whether we've reduced to a 16-bit or 17-bit value.
> 3) In asm statemnt, constraint letter "b" (base register) should be
> used instead of "r" for pointer operand.
Ok, I didn't see the ppc constraints. Does this make any real-world
differences?
> 4) Load with update is slower than ordinary load (+ add) depending on
> implementation (this is written in powerpc programming manual).
> With appended changes below, in4_cksum for a paticular packet
> (which is a nfs packet I've used to debug 2) above) runs about 15%
> faster on my Mac. I'm not sure how this affects other powerpc
> variant though.
On the IBM 405GP, I get the following from a test harness using the
current code ("old") and your patches ("new"):
walnut:~/src/cksum.ppc 24> ./t 2000000 10 1532
align:2 old=3e8d new=3e8d oldt=9.523496 sec (952 clocks) newt=9.543340 sec (952 clocks)
align:2 old=9aa1 new=9aa1 oldt=9.523538 sec (952 clocks) newt=9.534748 sec (953 clocks)
align:1 old=b3e8 new=b3e8 oldt=9.691512 sec (969 clocks) newt=9.695219 sec (967 clocks)
align:0 old=9eb1 new=9eb1 oldt=9.432990 sec (941 clocks) newt=9.434793 sec (940 clocks)
align:1 old=c4ed new=c4ed oldt=9.691568 sec (968 clocks) newt=9.694795 sec (969 clocks)
align:0 old=09a2 new=09a2 oldt=9.433257 sec (943 clocks) newt=9.442578 sec (943 clocks)
align:3 old=587e new=587e oldt=9.713674 sec (971 clocks) newt=9.704928 sec (969 clocks)
align:1 old=21a4 new=21a4 oldt=9.691515 sec (967 clocks) newt=9.695213 sec (968 clocks)
align:0 old=7576 new=7576 oldt=9.433068 sec (942 clocks) newt=9.442846 sec (943 clocks)
align:0 old=6bdd new=6bdd oldt=9.433201 sec (943 clocks) newt=9.434634 sec (942 clocks)
That command line is 2000000 iterations over ten different random 1532
byte mbufs, with a random alignment (the first column is "alignment %
4") for each mbuf. So for the 405GP, there's ~no difference in speed.
Given that there is a marked difference with your Mac, it would seem
that switching to non-update loads seems like a sound decision. It
would be interesting to get benchmarks for other CPU models.
> Comments?
Thanks for looking at this! It'll be interesting to see if it fixes the
problem Bill was looking at.
Simon.
--
Simon Burge <simonb@wasabisystems.com>
NetBSD Development, Support and Service: http://www.wasabisystems.com/