Port-powerpc archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

A few possible kernel assembly perf optimizations



  [Since this is PowerPC specific, I'm sending to here instead of
tech-perform.]

  I don't know how performance-critical our assembly code is, but looking
through some of the assembly files in sys/arch/powerpc/powerpc, there
are a few things that strike me as not as efficient as they could be.

  1.  mtcr's that could really be mtcrf's.  On the MPC7450 and newer
  parts, a single-field mtcrf is not serialized at all and can execute
  in any of the three IU1 units, whereas an
  mtcr (which is really an mtcrf of all 8 fields) is execution
  serialized and must go to the lone IU2 unit -- see table A-4 and the
  subsequent NOTE in Application Note AN2203/D, "MPC7450 Microprocessor
  Family Software Optimization Guide", available from 
  
  
http://e-www.motorola.com/webapp/sps/library/documentationlist.jsp?rootNodeId=03C1TR04670871&nodeId=03C1TR046708718653&Device=MPC7450&DocTypeKey=10KscRcb&Results=25

  On longer pipelines (7450 etc.), this needless serialization hurts more
  than on the short 603 and 604 pipelines.

  As an example, in trapexit, there's an lwz to r5, followed by an
  mtcr of r5, just to test bit 17.  Doing an "mtcrf 0x08, r5" instead
  should execute faster on an MPC7450, and no slower on any other parts.


  2.  Use of xor to generate a 0 value instead of li.  For example,
  locore_subr.S is sprinkled with xor 3,3,3 or xor 31,31,31.  This
  creates a true dependency between the xor and whoever last updated the
  register.  In locore_subr.S, in cpu_switch, there is the following
  code (I've removed the MULTIPROCESSOR code to make the flow easier to follow).

        lis     3,_C_LABEL(curpcb)@ha
        lwz     31,_C_LABEL(curpcb)@l(3)
        xor     3,3,3                   /* spl0() */
        bl      _C_LABEL(lcsplx)


  The xor now depends on the lis instruction.  If an li 3,0 is used
  instead, the li can execute in parallel with the lis.

  Also, from a readability point of view, I think most PowerPC assembly
  writers find "li rX, 0" more readable than "xor rX,rX,rX".  :)

  3.  Use of blrl for indirect jumps, instead of bctrl.  The MPC7450
  contains an 8-entry link-stack for predicting blr's well.  A blrl 
  used for an indirect subroutine call hoses that predictor, as it looks
  like a return (so pop an entry off the link stack, even though this
  isn't truly a return) and a call (so push a new entry on the link
  stack).  When the link stack gets corrupted like this, MPC7450
  invalidates the whole thing, which means one misuse of the LR can mean 7
  additional branch mispredicts.  (In reality, one should only use blrl if
  one is using coroutines, and since no one uses those, one should never
  use blrl on MPC7450 or other microprocessors that have a link-stack for
  branch prediction.)  (And yes, gcc had been doing this wrong for years
  as well, it just didn't really hurt you until the MPC7450.  I believe
  it's finally fixed in 3.2.)

  Do 'grep blrl /sys/arch/*pc*/*pc*/*.S' to see 6 occurrences.

  4.  In a similar vein, the use of bla to s_trap in trap_subr.S to
  remember what exception vector we came through will do a push that is
  never popped, also leading to corruption of the link stack in the near
  future.  A less-cute but higher-performance method would simply be to
  load a different constant for each vector into another register, but I
  think that requires saving more state before going to s_trap, so maybe
  there is no easy fix here.


  Unfortunately, I don't have access to any NetBSD-running PowerPC
hardware to actually see if this is more than in the noise....

  Brian, speaking as a guy who did way too much PowerPC assembly
programming as a grad student!

P.S.  Kudos to whomever came up with the idea of using cntlzw in
cpu_switch to figure out which queues had things in them.  I've rarely
seen cntlzw used for this, even though this is probably one of the
reasons PowerPC has cntlzw!



Home | Main Index | Thread Index | Old Index