Subject: Re: about powerpc version of in{,4}_cksum
To: Simon Burge <simonb@wasabisystems.com>
From: enami tsugutomo <enami@but-b.or.jp>
List: port-powerpc
Date: 07/31/2002 00:05:15
> > 1) addze 7,7 or addze %1,%1 are used to clear carry bit, but they
> > aren't correct. The former fails to clear if junk register r7
> > happen to contain 0xffffffff, and the latter may crobber non-junk
> > (i.e., necessary) register.
>
> I'm not sure that "addic 0,0,0" is going to do the right thing either
> for clearing the carry. For roughly this code fragment
>
> > n = mlen >> 6;
> > ! __asm __volatile(
> > ! "addic 0,0,0;" /* clear carry */
> > ! "mtctr %1;" /* load loop count */
> > ! "1:"
> > ! "lwz 7,4(%2);" /* load current data
>
> here's the "objdump -d"
>
> 12c: 7f e0 36 70 srawi r0,r31,6
> 130: 39 66 ff fc addi r11,r6,-4
> 134: 30 00 00 00 addic r0,r0,0
> 138: 7c 09 03 a6 mtctr r0
> 13c: 80 eb 00 04 lwz r7,4(r11)
>
> gcc has put "n" in r0, so the "addic 0,0,0" will add the carry bit to
> "n". Allen suggested the following:
>
> addi 7,0,0; addic 7,7,0;
>
> "make sure r7 is zero, then add 0 to it w/ carry". We choose r7 since
> that is one of the regs we mark as clobbered.
`addic rD, rA, SIMMM' is defined as (as far as according to my
manual):
rD <- (rA) + EXTS(SIMM)
So, no carry is added (but carry is affected), and if SIMM is zero,
carry bit always cleared. So, addic 0,0,0 should be nop except that
it clears carry bit. Am I losing something?
> > 2) When adjusting to 4 byte boundary, just adding 16bit value to the
> > variable `sum' isn't enough, since the `sum' may have full 32bit
> > value there, depending on how a packet is divided into mbufs. So,
> > we need to care carry bit. This actually prevented my Mac
> > (g4-500dp) from netbooting. (we can REDUCE instead but it results
> > longer instructions).
>
> I wonder if just:
>
> if ((3 & (long) w) && (mlen > 0)) {
> REDUCE1;
> if ((1 & (long) w)) {
> sum <<= 8;
> s_util.c[0] = *w++;
> mlen--;
> byte_swapped = 1;
> }
> if ((2 & (long) w) && (mlen > 1)) {
> sum += *(uint16_t *)w;
> w += 2;
> mlen -= 2;
> }
> }
>
> wouldn't be better? It also occurs to me that only the last REDUCE
> has to be a REDUCE; the others can be a REDUCE1 - we don't care in the
> intermediate code whether we've reduced to a 16-bit or 17-bit value.
The difference between my asm() and sum += *(...) is just addc;addze
vs add. And I guess REDUCE1 is much instruction than the differnce,
and I'm not sure if how the odd byte case is so familier comparing to
2 byte unalingned case.
> > 3) In asm statemnt, constraint letter "b" (base register) should be
> > used instead of "r" for pointer operand.
>
> Ok, I didn't see the ppc constraints. Does this make any real-world
> differences?
Please read the rs6000.h. The differnce is whether r0 can be used or
not. And if you read my diff, you will notoice that forcing r0 for
the variable `n' is no longer necessary.
> That command line is 2000000 iterations over ten different random 1532
> byte mbufs, with a random alignment (the first column is "alignment %
> 4") for each mbuf. So for the 405GP, there's ~no difference in speed.
> Given that there is a marked difference with your Mac, it would seem
> that switching to non-update loads seems like a sound decision. It
> would be interesting to get benchmarks for other CPU models.
Could you please provide me your test case? As I wrote in my mail, I
just test only single pattern.
enami.