NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: toolchain/49275: compiler_rt version of __clzsi2() seems to result in universally inferior code
The following reply was made to PR toolchain/49275; it has been noted by GNATS.
From: David Laight <david%l8s.co.uk@localhost>
To: gnats-bugs%NetBSD.org@localhost
Cc:
Subject: Re: toolchain/49275: compiler_rt version of __clzsi2() seems to result in universally inferior code
Date: Sun, 7 Dec 2014 22:44:52 +0000
On Sun, Oct 12, 2014 at 12:00:01AM +0000, dennis.c.ferguson%gmail.com@localhost wrote:
> >Number: 49275
> >Synopsis: compiler_rt version of __clzsi2() seems to result in universally inferior code
> I don't think the compiler_rt version of __clzsi2() in
>
> sys/external/bsd/compiler_rt/dist/lib/builtins/clzsi2.c
>
> is very useful. Compiling it results in a longer code sequence
> (often significantly longer) than a more straight forward (in my view)
> C implementation of the same function with all the machine/compiler
> combinations available to me.
>
> Three implementations of __clzsi2() are provided below, all computing
> the same result for the same input. The first, _crt, version is
> copied from the compiler_rt library, the _opa version keeps the
> use-results-of-comparisons-as-values style of the library but
> reorganizes it to eliminate some arithmetic in the C, while the
> _opb version recodes _opa into a more conventional (in my opinion)
> implementation with if() statements. These were compiled with
> the compilers I had (-O2) and the generated instructions were
> counted. When branches were present in the generated code I counted
> the instructions in the longest path through the code and appended
> a `-' to the number to indicate there are paths with fewer (but
> maybe not faster) instructions. Here are the results:
>
> crt opa opb
> i386 60 41 26-
> amd64 53 37 25-
> arm 38 25 19-
> ppc32 35- 29 23-
> ppc64 37- 35 27-
> coldfire 44- 42- 37-
> riscv32 39- 25 21-
> riscv64 37- 25 21-
> amd64cl 43 34 27-
> i386cl 46 37 28-
Instruction count isn't a very accurate way of determining the
execution time on modern cpus.
Mispredicted branches can get very expensive, and multiple execution
units mean that long dependency chains matter.
This probably means that 'opb' is slower than you expect.
Don't benchmark by running with a fixed value in a loop.
With the very old compiler I have to hand changing 'opa' to
test '((x & 0xff000000) == 0)' etc (ie shift the constant, not x)
generates better code.
David
--
David Laight: david%l8s.co.uk@localhost
Home |
Main Index |
Thread Index |
Old Index