Subject: Re: C runqueue
To: Gregory McGarry <g.mcgarry@ieee.org>
From: Charles M. Hannum <abuse@spamalicious.com>
List: port-i386
Date: 10/23/2002 16:10:19
In addition to David's comments (specifically, the Pentium 4 issue
with the ROL instruction), I note that if you're going strictly by
cycle count, it would be even faster to put the mask values in a table
and eliminate the ROL, doing just MOV REG,MEM/AND MEM,REG instead.

You could even eliminate a shift count on the MOV REG,MEM (making it
faster on the P4), and eliminate a SHR REG,IMM at the same time
(making it yet faster on the P4), by changing the `shrl $2,%eax' to an
`andl $-4,%eax'.

Going a step further, you can change the AND for the BTR expansion to
a SUB (because we *know* that the bit is set), and then use the same
table for both the DIAGNOSTIC test in remrunqueue() and expanding the
BTS in setrunqueue() to MOV REG,MEM/OR MEM,REG.

This makes the assembler versions substantially faster than the C
versions on all x86 processors.