port-powerpc: A few possible kernel assembly perf optimizations

Subject: A few possible kernel assembly perf optimizations
To: None <port-powerpc@netbsd.org>
From: Brian Grayson \(home\) <bgrayson@austin.rr.com>
List: port-powerpc
Date: 12/12/2002 00:18:16

[Since this is PowerPC specific, I'm sending to here instead of
tech-perform.]

I don't know how performance-critical our assembly code is, but looking
through some of the assembly files in sys/arch/powerpc/powerpc, there
are a few things that strike me as not as efficient as they could be.

1. mtcr's that could really be mtcrf's. On the MPC7450 and newer
parts, a single-field mtcrf is not serialized at all and can execute
in any of the three IU1 units, whereas an
mtcr (which is really an mtcrf of all 8 fields) is execution
serialized and must go to the lone IU2 unit -- see table A-4 and the
subsequent NOTE in Application Note AN2203/D, "MPC7450 Microprocessor
Family Software Optimization Guide", available from

http://e-www.motorola.com/webapp/sps/library/documentationlist.jsp?rootNodeId=03C1TR04670871&nodeId=03C1TR046708718653&Device=MPC7450&DocTypeKey=10KscRcb&Results=25

On longer pipelines (7450 etc.), this needless serialization hurts more
than on the short 603 and 604 pipelines.

As an example, in trapexit, there's an lwz to r5, followed by an
mtcr of r5, just to test bit 17. Doing an "mtcrf 0x08, r5" instead
should execute faster on an MPC7450, and no slower on any other parts.

2. Use of xor to generate a 0 value instead of li. For example,
locore_subr.S is sprinkled with xor 3,3,3 or xor 31,31,31. This
creates a true dependency between the xor and whoever last updated the
register. In locore_subr.S, in cpu_switch, there is the following
code (I've removed the MULTIPROCESSOR code to make the flow easier to follow).

lis 3,_C_LABEL(curpcb)@ha
lwz 31,_C_LABEL(curpcb)@l(3)
xor 3,3,3 /* spl0() */
bl _C_LABEL(lcsplx)

The xor now depends on the lis instruction. If an li 3,0 is used
instead, the li can execute in parallel with the lis.

Also, from a readability point of view, I think most PowerPC assembly
writers find "li rX, 0" more readable than "xor rX,rX,rX". :)

3. Use of blrl for indirect jumps, instead of bctrl. The MPC7450
contains an 8-entry link-stack for predicting blr's well. A blrl
used for an indirect subroutine call hoses that predictor, as it looks
like a return (so pop an entry off the link stack, even though this
isn't truly a return) and a call (so push a new entry on the link
stack). When the link stack gets corrupted like this, MPC7450
invalidates the whole thing, which means one misuse of the LR can mean 7
additional branch mispredicts. (In reality, one should only use blrl if
one is using coroutines, and since no one uses those, one should never
use blrl on MPC7450 or other microprocessors that have a link-stack for
branch prediction.) (And yes, gcc had been doing this wrong for years
as well, it just didn't really hurt you until the MPC7450. I believe
it's finally fixed in 3.2.)

Do 'grep blrl /sys/arch/*pc*/*pc*/*.S' to see 6 occurrences.

4. In a similar vein, the use of bla to s_trap in trap_subr.S to
remember what exception vector we came through will do a push that is
never popped, also leading to corruption of the link stack in the near
future. A less-cute but higher-performance method would simply be to
load a different constant for each vector into another register, but I
think that requires saving more state before going to s_trap, so maybe
there is no easy fix here.

Unfortunately, I don't have access to any NetBSD-running PowerPC
hardware to actually see if this is more than in the noise....

Brian, speaking as a guy who did way too much PowerPC assembly
programming as a grad student!

P.S. Kudos to whomever came up with the idea of using cntlzw in
cpu_switch to figure out which queues had things in them. I've rarely
seen cntlzw used for this, even though this is probably one of the
reasons PowerPC has cntlzw!