Re: toolchain/49275: compiler_rt version of __clzsi2() seems to result in universally inferior code

To: toolchain-manager%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost,dennis.c.ferguson%gmail.com@localhost
Subject: Re: toolchain/49275: compiler_rt version of __clzsi2() seems to result in universally inferior code
From: Dennis Ferguson <dennis.c.ferguson%gmail.com@localhost>
Date: Tue, 9 Dec 2014 19:35:00 +0000 (UTC)

The following reply was made to PR toolchain/49275; it has been noted by GNATS.

From: Dennis Ferguson <dennis.c.ferguson%gmail.com@localhost>
To: gnats-bugs%NetBSD.org@localhost
Cc: toolchain-manager%netbsd.org@localhost,
 gnats-admin%netbsd.org@localhost,
 netbsd-bugs%netbsd.org@localhost
Subject: Re: toolchain/49275: compiler_rt version of __clzsi2() seems to result in universally inferior code
Date: Tue, 9 Dec 2014 11:33:13 -0800

 On 7 Dec, 2014, at 14:50 , David Laight <david%l8s.co.uk@localhost> wrote:
 >>                crt     opa     opb
 >>    i386        60      41      26-
 >>    amd64       53      37      25-
 >>    arm         38      25      19-
 >>    ppc32       35-     29      23-
 >>    ppc64       37-     35      27-
 >>    coldfire    44-     42-     37-
 >>    riscv32     39-     25      21-
 >>    riscv64     37-     25      21-
 >>    amd64cl     43      34      27-
 >>    i386cl      46      37      28-
 > 
 > Instruction count isn't a very accurate way of determining the
 > execution time on modern cpus.
 > Mispredicted branches can get very expensive, and multiple execution
 > units mean that long dependency chains matter.
 > 
 > This probably means that 'opb' is slower than you expect.
 > Don't benchmark by running with a fixed value in a loop.
 > 
 > With the very old compiler I have to hand changing 'opa' to
 > test '((x & 0xff000000) == 0)' etc (ie shift the constant, not x)
 > generates better code.

 You can shift the variable or shift the mask and get the same
 boolean result, but I think the better of these depends strongly
 on the CPU and its instruction set.  I just pick one way for the C
 code and assume the compiler will be smart enough to do it the
 other way if that is cheaper for the target.  If the compiler
 isn't that smart now I"m willing to wait for a version that is.

 Note, though, that while everything you say is true in general
 we're not really discussing the general case but rather 3 alternative
 implementations of one particular function.  All of these are one
 long dependency chain (the computation of `r' is a sideline, but
 `x' and `t' computations are pretty much constrained to be
 sequential), so multiple issue CPUs have less advantage here
 than in more "normal" code.  In cases like this the relative
 instruction count may be a bit more accurate predictor of
 execution cost (though maybe still not on an x86, which doesn't
 execute its own instruction set), but that's not exactly what I
 expected the table to convey.  Additionally, adds of and shifts
 by constants (a la _opb) can be speculatively executed in parallel
 with the if() computation and tossed if they aren't needed,
 but adds of and shifts by `t' are generally going to have to wait
 until it knows `t', so that plus limited parallelism elsewhere may
 mean that 'opb' is in fact often faster than you seem to expect.

 But even that is kind of beside the point.  The issue isn't something
 subtle about code generation or machine execution, it is that the
 the compiler_rt __clzsi2() implementation is crappy.  It does
 too much arithmetic to compute its result, it is too complex.  This
 is true on every machine. The reason that 'opa' and 'opb' need fewer
 instructions than the current 'crt' algorithm on all the machines we
 do know about is that 'opa' and 'opb' compute fewer things in the
 C code as written to find the same result, and doing less work requires
 fewer instructions.  That (plus the inability of the compiler to
 recognize the extra work done in 'crt' is useless) is what I expected
 the instruction count to convey.  Between "opa" and "opb" there is
 less to choose since the basic algorithm is the same, and I'd be happy
 with either, but "opb" is just simpler to my eye and that is usually
 better.  This is C code after all; the goal isn't to write code which
 is optimal for any machine, it is to write code which is clear enough
 about what is being computed that the compiler (or some future compiler)
 can understand the intent well enough to produce code which is optimal
 for a particular target.

 Note that execution times on particular machines are consistent with
 the conclusion that the compiler_rt code is crap.  Averaged over all
 values of the 32 bit argument, for RISC-V, which is particularly
 relevant since it has no clz instruction, the simulator says that 'opb'
 is not quite twice the speed of 'crt'.  On my arm machine 'opb' is
 twice the speed of 'crt', while the hand-written assembler __clzsi2()
 it uses now is 2.5 times faster than 'crt' (i.e. 25% faster than 'opb')
 while producing slightly different results.  For amd64 and i386 'opb'
 is 55% faster than 'crt' on my fancy Core i7, while on my crappy Intel
 Atom machine the same i386 'opb' binary runs 3 times faster than 'crt'
 (and more than twice as fast as 'opa') for reasons I can guess at.
 This is a fairly trivial function, the magnitude of those differences
 is indicative of just how poor the compiler_rt implementation is.  I'm
 pretty confident you won't find any machine where 'crt' is faster; it
 would take a miracle for a compiler to understand what 'crt' is doing
 well enough to eliminate the junk and produce the same machine code as
 'opb' (which is what would happen in a perfect world since they produce
 the same results for the same input).

 I have a library which, among other things, provides a 32 bit clz
 operation that can be made optimal for any machine which someone cares
 about enough to write the specific code for.  For most machines the
 optimal implementation is __builtin_clz() because most machines have
 an instruction to compute this and __builtin_clz() generates it.  I
 would like __builtin_clz() to become more-or-less optimal for all
 machines (rather than just most) so that this part of my own library
 could go away.  That means dealing with machines which don't have an
 instruction and instead call __clzsi2() in the runtime library.

 For RISC-V, which has no instruction and calls __clzsi2(), 'opb'
 produces code which is as good as I could write in assembly and
 performs significantly better than the compiler_rt implementation.
 I could add this as a RISC-V special, the way the arm assembly
 implementation is handled, but doing it this way would imply the
 compiler_rt implementation might have some value on some other
 machine and I don't believe that is true.  The compiler_rt code
 is crappy and replacing it would do everyone a favor.

 You seem reluctant to replace it.  If you don't like the instruction
 count metric (and I could add more machines to that table; 'crt'
 always loses) is there some other measureable that would more reliably
 indicate to you that the compiler_rt code is indeed just universally
 crappy?  I haven't found one that even hints that it isn't.

 Dennis Ferguson

Follow-Ups:
- Re: toolchain/49275: compiler_rt version of __clzsi2() seems to result in universally inferior code
  - From: Joerg Sonnenberger

Prev by Date: Re: toolchain/49275: compiler_rt version of __clzsi2() seems to result in universally inferior code
Next by Date: NetBSD Nightly Trouble Ticket Report
Previous by Thread: Re: toolchain/49275: compiler_rt version of __clzsi2() seems to result in universally inferior code
Next by Thread: Re: toolchain/49275: compiler_rt version of __clzsi2() seems to result in universally inferior code
Indexes:

Home | Main Index | Thread Index | Old Index