NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
toolchain/49275: compiler_rt version of __clzsi2() seems to result in universally inferior code
>Number: 49275
>Category: toolchain
>Synopsis: compiler_rt version of __clzsi2() seems to result in universally inferior code
>Confidential: no
>Severity: non-critical
>Priority: low
>Responsible: toolchain-manager
>State: open
>Class: change-request
>Submitter-Id: net
>Arrival-Date: Sun Oct 12 00:00:00 +0000 2014
>Originator: Dennis Ferguson
>Release: NetBSD 7.99.1
>Organization:
none at all
>Environment:
>Description:
I don't think the compiler_rt version of __clzsi2() in
sys/external/bsd/compiler_rt/dist/lib/builtins/clzsi2.c
is very useful. Compiling it results in a longer code sequence
(often significantly longer) than a more straight forward (in my view)
C implementation of the same function with all the machine/compiler
combinations available to me.
Three implementations of __clzsi2() are provided below, all computing
the same result for the same input. The first, _crt, version is
copied from the compiler_rt library, the _opa version keeps the
use-results-of-comparisons-as-values style of the library but
reorganizes it to eliminate some arithmetic in the C, while the
_opb version recodes _opa into a more conventional (in my opinion)
implementation with if() statements. These were compiled with
the compilers I had (-O2) and the generated instructions were
counted. When branches were present in the generated code I counted
the instructions in the longest path through the code and appended
a `-' to the number to indicate there are paths with fewer (but
maybe not faster) instructions. Here are the results:
crt opa opb
i386 60 41 26-
amd64 53 37 25-
arm 38 25 19-
ppc32 35- 29 23-
ppc64 37- 35 27-
coldfire 44- 42- 37-
riscv32 39- 25 21-
riscv64 37- 25 21-
amd64cl 43 34 27-
i386cl 46 37 28-
The compilers are all gcc except for the last two. In a perfect
world the results in a row would all be the same (minimal) value
since the C implementations all compute the same thing with very
similar algorithms, but it is clear the compiler_rt version does
not agree well with the world as it is.
Of these I think this is of practical consequence only for RISC-V,
which has no clz instruction. The 21 instructions generated from
_opb appears to me to be the minimal sequence for this instruction
set, making hand-crafted assembly for top speed unnecessary. A
reason to be interested in making clz fast is that a clz operation
can be a significant fraction of the cost of doing a (well-cached)
IP route lookup on a machine with no instruction.
>How-To-Repeat:
Compile this and look at the generated assembly:
typedef int si_int;
typedef unsigned int su_int;
si_int
__clzsi2_crt(si_int a)
{
su_int x = (su_int) a;
si_int t = ((x & 0xffff0000) == 0) << 4;
si_int r = t;
x >>= 16 - t;
t = ((x & 0xff00) == 0) << 3;
x >>= 8 - t;
r += t;
t = ((x & 0xf0) == 0) << 2;
x >>= 4 - t;
r += t;
t = ((x & 0xc) == 0) << 1;
x >>= 2 - t;
r += t;
return (r + ((2 - x) & -((x & 2) == 0)));
}
si_int
__clzsi2_opa(si_int a)
{
su_int x = (su_int) a;
si_int r = (((x >> 16) & 0xffff) == 0) << 4;
si_int t = r;
x <<= t;
t = (((x >> 24) & 0xff) == 0) << 3;
x <<= t;
r += t;
t = (((x >> 28) & 0xf) == 0) << 2;
x <<= t;
r += t;
t = (((x >> 30) & 0x3) == 0) << 1;
x <<= t;
r += t;
r += ((x & (0x1u << 31)) == 0) + (x == 0);
return (r);
}
si_int
__clzsi2_opb(si_int a)
{
su_int x = (su_int) a;
si_int r = 0;
if (((x >> 16) & 0xffff) == 0) {
r = 16;
x <<= 16;
}
if (((x >> 24) & 0xff) == 0) {
r += 8;
x <<= 8;
}
if (((x >> 28) & 0xf) == 0) {
r += 4;
x <<= 4;
}
if (((x >> 30) & 0x3) == 0) {
r += 2;
x <<= 2;
}
if ((x & (0x1 << 31)) == 0) {
r += 1;
if (x == 0) {
r += 1;
}
}
return (r);
}
>Fix:
Home |
Main Index |
Thread Index |
Old Index