toolchain/49275: compiler_rt version of __clzsi2() seems to result in universally inferior code

To: toolchain-manager%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost
Subject: toolchain/49275: compiler_rt version of __clzsi2() seems to result in universally inferior code
From: dennis.c.ferguson%gmail.com@localhost
Date: Sun, 12 Oct 2014 00:00:01 +0000 (UTC)

>Number:         49275
>Category:       toolchain
>Synopsis:       compiler_rt version of __clzsi2() seems to result in universally inferior code
>Confidential:   no
>Severity:       non-critical
>Priority:       low
>Responsible:    toolchain-manager
>State:          open
>Class:          change-request
>Submitter-Id:   net
>Arrival-Date:   Sun Oct 12 00:00:00 +0000 2014
>Originator:     Dennis Ferguson
>Release:        NetBSD 7.99.1
>Organization:
none at all
>Environment:
>Description:
I don't think the compiler_rt version of __clzsi2() in

   sys/external/bsd/compiler_rt/dist/lib/builtins/clzsi2.c

is very useful.  Compiling it results in a longer code sequence
(often significantly longer) than a more straight forward (in my view)
C implementation of the same function with all the machine/compiler
combinations available to me.

Three implementations of __clzsi2() are provided below, all computing
the same result for the same input.  The first, _crt, version is
copied from the compiler_rt library, the _opa version keeps the
use-results-of-comparisons-as-values style of the library but
reorganizes it to eliminate some arithmetic in the C, while the
_opb version recodes _opa into a more conventional (in my opinion)
implementation with if() statements.  These were compiled with
the compilers I had (-O2) and the generated instructions were
counted.  When branches were present in the generated code I counted
the instructions in the longest path through the code and appended
a `-' to the number to indicate there are paths with fewer (but
maybe not faster) instructions.  Here are the results:

                crt     opa     opb
    i386        60      41      26-
    amd64       53      37      25-
    arm         38      25      19-
    ppc32       35-     29      23-
    ppc64       37-     35      27-
    coldfire    44-     42-     37-
    riscv32     39-     25      21-
    riscv64     37-     25      21-
    amd64cl     43      34      27-
    i386cl      46      37      28-

The compilers are all gcc except for the last two.  In a perfect
world the results in a row would all be the same (minimal) value
since the C implementations all compute the same thing with very
similar algorithms, but it is clear the compiler_rt version does
not agree well with the world as it is.

Of these I think this is of practical consequence only for RISC-V,
which has no clz instruction.  The 21 instructions generated from
_opb appears to me to be the minimal sequence for this instruction
set, making hand-crafted assembly for top speed unnecessary.  A
reason to be interested in making clz fast is that a clz operation
can be a significant fraction of the cost of doing a (well-cached)
IP route lookup on a machine with no instruction.
>How-To-Repeat:
Compile this and look at the generated assembly:

typedef int si_int;
typedef unsigned int su_int;

si_int
__clzsi2_crt(si_int a)
{
        su_int x = (su_int) a;
        si_int t = ((x & 0xffff0000) == 0) << 4;
        si_int r = t;

        x >>= 16 - t;
        t = ((x & 0xff00) == 0) << 3;
        x >>= 8 - t;
        r += t;
        t = ((x & 0xf0) == 0) << 2;
        x >>= 4 - t;
        r += t;
        t = ((x & 0xc) == 0) << 1;
        x >>= 2 - t;
        r += t;
        return (r + ((2 - x) & -((x & 2) == 0)));
}

si_int
__clzsi2_opa(si_int a)
{
        su_int x = (su_int) a;
        si_int r = (((x >> 16) & 0xffff) == 0) << 4;
        si_int t = r;

        x <<= t;
        t = (((x >> 24) & 0xff) == 0) << 3;
        x <<= t;
        r += t;
        t = (((x >> 28) & 0xf) == 0) << 2;
        x <<= t;
        r += t;
        t = (((x >> 30) & 0x3) == 0) << 1;
        x <<= t;
        r += t;
        r += ((x & (0x1u << 31)) == 0) + (x == 0);
        return (r);
}

si_int
__clzsi2_opb(si_int a)
{
        su_int x = (su_int) a;
        si_int r = 0;

        if (((x >> 16) & 0xffff) == 0) {
                r = 16;
                x <<= 16;
        }
        if (((x >> 24) & 0xff) == 0) {
                r += 8;
                x <<= 8;
        }
        if (((x >> 28) & 0xf) == 0) {
                r += 4;
                x <<= 4;
        }
        if (((x >> 30) & 0x3) == 0) {
                r += 2;
                x <<= 2;
        }
        if ((x & (0x1 << 31)) == 0) {
                r += 1;
                if (x == 0) {
                        r += 1;
                }
        }
        return (r);
}

>Fix:

Prev by Date: NetBSD Nightly Trouble Ticket Report
Next by Date: Re: bin/48874
Previous by Thread: Re: bin/48140 (dhcpcd fails to get lease in -cur rent)
Next by Thread: Re: toolchain/49275: compiler_rt version of __clzsi2() seems to result in universally inferior code
Indexes:

Home | Main Index | Thread Index | Old Index