NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: toolchain/49275: compiler_rt version of __clzsi2() seems to result in universally inferior code



The following reply was made to PR toolchain/49275; it has been noted by GNATS.

From: David Laight <david%l8s.co.uk@localhost>
To: gnats-bugs%NetBSD.org@localhost
Cc: 
Subject: Re: toolchain/49275: compiler_rt version of __clzsi2() seems to result in universally inferior code
Date: Sun, 7 Dec 2014 22:44:52 +0000

 On Sun, Oct 12, 2014 at 12:00:01AM +0000, dennis.c.ferguson%gmail.com@localhost wrote:
 > >Number:         49275
 > >Synopsis:       compiler_rt version of __clzsi2() seems to result in universally inferior code
 > I don't think the compiler_rt version of __clzsi2() in
 > 
 >    sys/external/bsd/compiler_rt/dist/lib/builtins/clzsi2.c
 > 
 > is very useful.  Compiling it results in a longer code sequence
 > (often significantly longer) than a more straight forward (in my view)
 > C implementation of the same function with all the machine/compiler
 > combinations available to me.
 > 
 > Three implementations of __clzsi2() are provided below, all computing
 > the same result for the same input.  The first, _crt, version is
 > copied from the compiler_rt library, the _opa version keeps the
 > use-results-of-comparisons-as-values style of the library but
 > reorganizes it to eliminate some arithmetic in the C, while the
 > _opb version recodes _opa into a more conventional (in my opinion)
 > implementation with if() statements.  These were compiled with
 > the compilers I had (-O2) and the generated instructions were
 > counted.  When branches were present in the generated code I counted
 > the instructions in the longest path through the code and appended
 > a `-' to the number to indicate there are paths with fewer (but
 > maybe not faster) instructions.  Here are the results:
 > 
 >                 crt     opa     opb
 >     i386        60      41      26-
 >     amd64       53      37      25-
 >     arm         38      25      19-
 >     ppc32       35-     29      23-
 >     ppc64       37-     35      27-
 >     coldfire    44-     42-     37-
 >     riscv32     39-     25      21-
 >     riscv64     37-     25      21-
 >     amd64cl     43      34      27-
 >     i386cl      46      37      28-
 
 Instruction count isn't a very accurate way of determining the
 execution time on modern cpus.
 Mispredicted branches can get very expensive, and multiple execution
 units mean that long dependency chains matter.
 
 This probably means that 'opb' is slower than you expect.
 Don't benchmark by running with a fixed value in a loop.
 
 With the very old compiler I have to hand changing 'opa' to
 test '((x & 0xff000000) == 0)' etc (ie shift the constant, not x)
 generates better code.
 
 	David
 
 -- 
 David Laight: david%l8s.co.uk@localhost
 


Home | Main Index | Thread Index | Old Index