Re: toolchain/49275: compiler_rt version of __clzsi2() seems to result in universally inferior code

To: gnats-bugs%NetBSD.org@localhost
Subject: Re: toolchain/49275: compiler_rt version of __clzsi2() seems to result in universally inferior code
From: Dennis Ferguson <dennis.c.ferguson%gmail.com@localhost>
Date: Tue, 9 Dec 2014 11:33:13 -0800

On 7 Dec, 2014, at 14:50 , David Laight <david%l8s.co.uk@localhost> wrote:
>>                crt     opa     opb
>>    i386        60      41      26-
>>    amd64       53      37      25-
>>    arm         38      25      19-
>>    ppc32       35-     29      23-
>>    ppc64       37-     35      27-
>>    coldfire    44-     42-     37-
>>    riscv32     39-     25      21-
>>    riscv64     37-     25      21-
>>    amd64cl     43      34      27-
>>    i386cl      46      37      28-
> 
> Instruction count isn't a very accurate way of determining the
> execution time on modern cpus.
> Mispredicted branches can get very expensive, and multiple execution
> units mean that long dependency chains matter.
> 
> This probably means that 'opb' is slower than you expect.
> Don't benchmark by running with a fixed value in a loop.
> 
> With the very old compiler I have to hand changing 'opa' to
> test '((x & 0xff000000) == 0)' etc (ie shift the constant, not x)
> generates better code.

You can shift the variable or shift the mask and get the same
boolean result, but I think the better of these depends strongly
on the CPU and its instruction set.  I just pick one way for the C
code and assume the compiler will be smart enough to do it the
other way if that is cheaper for the target.  If the compiler
isn't that smart now I"m willing to wait for a version that is.

Note, though, that while everything you say is true in general
we're not really discussing the general case but rather 3 alternative
implementations of one particular function.  All of these are one
long dependency chain (the computation of `r' is a sideline, but
`x' and `t' computations are pretty much constrained to be
sequential), so multiple issue CPUs have less advantage here
than in more "normal" code.  In cases like this the relative
instruction count may be a bit more accurate predictor of
execution cost (though maybe still not on an x86, which doesn't
execute its own instruction set), but that's not exactly what I
expected the table to convey.  Additionally, adds of and shifts
by constants (a la _opb) can be speculatively executed in parallel
with the if() computation and tossed if they aren't needed,
but adds of and shifts by `t' are generally going to have to wait
until it knows `t', so that plus limited parallelism elsewhere may
mean that 'opb' is in fact often faster than you seem to expect.

But even that is kind of beside the point.  The issue isn't something
subtle about code generation or machine execution, it is that the
the compiler_rt __clzsi2() implementation is crappy.  It does
too much arithmetic to compute its result, it is too complex.  This
is true on every machine. The reason that 'opa' and 'opb' need fewer
instructions than the current 'crt' algorithm on all the machines we
do know about is that 'opa' and 'opb' compute fewer things in the
C code as written to find the same result, and doing less work requires
fewer instructions.  That (plus the inability of the compiler to
recognize the extra work done in 'crt' is useless) is what I expected
the instruction count to convey.  Between "opa" and "opb" there is
less to choose since the basic algorithm is the same, and I'd be happy
with either, but "opb" is just simpler to my eye and that is usually
better.  This is C code after all; the goal isn't to write code which
is optimal for any machine, it is to write code which is clear enough
about what is being computed that the compiler (or some future compiler)
can understand the intent well enough to produce code which is optimal
for a particular target.

Note that execution times on particular machines are consistent with
the conclusion that the compiler_rt code is crap.  Averaged over all
values of the 32 bit argument, for RISC-V, which is particularly
relevant since it has no clz instruction, the simulator says that 'opb'
is not quite twice the speed of 'crt'.  On my arm machine 'opb' is
twice the speed of 'crt', while the hand-written assembler __clzsi2()
it uses now is 2.5 times faster than 'crt' (i.e. 25% faster than 'opb')
while producing slightly different results.  For amd64 and i386 'opb'
is 55% faster than 'crt' on my fancy Core i7, while on my crappy Intel
Atom machine the same i386 'opb' binary runs 3 times faster than 'crt'
(and more than twice as fast as 'opa') for reasons I can guess at.
This is a fairly trivial function, the magnitude of those differences
is indicative of just how poor the compiler_rt implementation is.  I'm
pretty confident you won't find any machine where 'crt' is faster; it
would take a miracle for a compiler to understand what 'crt' is doing
well enough to eliminate the junk and produce the same machine code as
'opb' (which is what would happen in a perfect world since they produce
the same results for the same input).

I have a library which, among other things, provides a 32 bit clz
operation that can be made optimal for any machine which someone cares
about enough to write the specific code for.  For most machines the
optimal implementation is __builtin_clz() because most machines have
an instruction to compute this and __builtin_clz() generates it.  I
would like __builtin_clz() to become more-or-less optimal for all
machines (rather than just most) so that this part of my own library
could go away.  That means dealing with machines which don't have an
instruction and instead call __clzsi2() in the runtime library.

For RISC-V, which has no instruction and calls __clzsi2(), 'opb'
produces code which is as good as I could write in assembly and
performs significantly better than the compiler_rt implementation.
I could add this as a RISC-V special, the way the arm assembly
implementation is handled, but doing it this way would imply the
compiler_rt implementation might have some value on some other
machine and I don't believe that is true.  The compiler_rt code
is crappy and replacing it would do everyone a favor.

You seem reluctant to replace it.  If you don't like the instruction
count metric (and I could add more machines to that table; 'crt'
always loses) is there some other measureable that would more reliably
indicate to you that the compiler_rt code is indeed just universally
crappy?  I haven't found one that even hints that it isn't.

Dennis Ferguson

References:
- Re: toolchain/49275: compiler_rt version of __clzsi2() seems to result in universally inferior code
  - From: David Laight

Prev by Date: Re: kern/49457: 'atactl sleep' loses drive
Next by Date: Re: toolchain/49275: compiler_rt version of __clzsi2() seems to result in universally inferior code
Previous by Thread: Re: toolchain/49275: compiler_rt version of __clzsi2() seems to result in universally inferior code
Next by Thread: Re: toolchain/49275: compiler_rt version of __clzsi2() seems to result in universally inferior code
Indexes:

Home | Main Index | Thread Index | Old Index