port-powerpc: Re: about powerpc version of in{,4}

Subject: Re: about powerpc version of in{,4}_cksum
To: Simon Burge <simonb@wasabisystems.com>
From: enami tsugutomo <enami@but-b.or.jp>
List: port-powerpc
Date: 07/31/2002 00:05:15
> > 1) addze 7,7 or addze %1,%1 are used to clear carry bit, but they
> >    aren't correct.  The former fails to clear if junk register r7
> >    happen to contain 0xffffffff, and the latter may crobber non-junk
> >    (i.e., necessary) register.
> 
> I'm not sure that "addic 0,0,0" is going to do the right thing either
> for clearing the carry.  For roughly this code fragment
> 
> >                       n = mlen >> 6;
> > !                     __asm __volatile(
> > !                         "addic 0,0,0;"              /* clear carry */
> > !                         "mtctr %1;"                 /* load loop count */
> > !                         "1:"
> > !                         "lwz 7,4(%2);"              /* load current data
> 
> here's the "objdump -d"
> 
> 	 12c:   7f e0 36 70     srawi   r0,r31,6
> 	 130:   39 66 ff fc     addi    r11,r6,-4
> 	 134:   30 00 00 00     addic   r0,r0,0
> 	 138:   7c 09 03 a6     mtctr   r0
> 	 13c:   80 eb 00 04     lwz     r7,4(r11)
> 
> gcc has put "n" in r0, so the "addic 0,0,0" will add the carry bit to
> "n".  Allen suggested the following:
> 
> 	addi 7,0,0; addic 7,7,0;
> 
> "make sure r7 is zero, then add 0 to it w/ carry".  We choose r7 since
> that is one of the regs we mark as clobbered.

`addic rD, rA, SIMMM' is defined as (as far as according to my
manual):

	rD <- (rA) + EXTS(SIMM)

So, no carry is added (but carry is affected), and if SIMM is zero,
carry bit always cleared.  So, addic 0,0,0 should be nop except that
it clears carry bit.  Am I losing something?

> > 2) When adjusting to 4 byte boundary, just adding 16bit value to the
> >    variable `sum' isn't enough, since the `sum' may have full 32bit
> >    value there, depending on how a packet is divided into mbufs.  So,
> >    we need to care carry bit.  This actually prevented my Mac
> >    (g4-500dp) from netbooting.  (we can REDUCE instead but it results
> >    longer instructions).
> 
> I wonder if just:
> 
> 		if ((3 & (long) w) && (mlen > 0)) {
> 			REDUCE1;
> 			if ((1 & (long) w)) {
> 				sum <<= 8;
> 				s_util.c[0] = *w++;
> 				mlen--;
> 				byte_swapped = 1;
> 			}
> 			if ((2 & (long) w) && (mlen > 1)) {
> 				sum += *(uint16_t *)w;
> 				w += 2;
> 				mlen -= 2;
> 			}
> 		}
> 
> wouldn't be better?  It also occurs to me that only the last REDUCE
> has to be a REDUCE; the others can be a REDUCE1 - we don't care in the
> intermediate code whether we've reduced to a 16-bit or 17-bit value.

The difference between my asm() and sum += *(...) is just addc;addze
vs add.  And I guess REDUCE1 is much instruction than the differnce,
and I'm not sure if how the odd byte case is so familier comparing to
2 byte unalingned case.

> > 3) In asm statemnt, constraint letter "b" (base register) should be
> >    used instead of "r" for pointer operand.
> 
> Ok, I didn't see the ppc constraints.  Does this make any real-world
> differences?

Please read the rs6000.h.  The differnce is whether r0 can be used or
not.  And if you read my diff, you will notoice that forcing r0 for
the variable `n' is no longer necessary.

> That command line is 2000000 iterations over ten different random 1532
> byte mbufs, with a random alignment (the first column is "alignment %
> 4") for each mbuf.  So for the 405GP, there's ~no difference in speed.
> Given that there is a marked difference with your Mac, it would seem
> that switching to non-update loads seems like a sound decision.  It
> would be interesting to get benchmarks for other CPU models.

Could you please provide me your test case?  As I wrote in my mail, I
just test only single pattern.

enami.