Subject: Re: performance impact of branch prediction?
To: None <tech-kern@NetBSD.org>
From: Ignatios Souvatzis <ignatios@cs.uni-bonn.de>
List: tech-kern
Date: 04/21/2006 17:12:14
On Fri, Apr 21, 2006 at 03:57:55PM +0200, Hubert Feyrer wrote:
> 
> Branch prediction is used in many parts throught the kernel, e.g.
> 
> subr_pool.c:    if (__predict_false((pc->pc_pool->pr_flags & PR_WANTED) != 
> 0)) {
> subr_pool.c-            goto destruct;
> subr_pool.c-    }
> 
> Does anyone know the performance impact of this prediction?
> Were any kind of measurements done? What were the results?

I don't know benchmarks. actually: yes, one, and it's not a typical 
application.

In theory, a cpu can have conditional branches that are faster not-taken
than taken. Especially for deeply pipelined modern CPUs, also the other
way round.

Some do dynamic branch prediction based on history (Motorola 68060).

Some can have either dynamic or compiler-hinted prediction (PowerPC family).

For the latter, the compiler has to tell the assembler which sort of
branch to generate, in case it's executed on a CPU that doesn't do 
dynamic prediction. AFAIK, the constructs you mention might hint to the
compiler what to do.

As for the benchmark: I wrote (well, mostly copied from hp300) the 
kernel delay loop, and did some timing tests to savely guess the initial
timing parameter until the kernel calibrates itself. It turns out that for

1:	subl	%d0,%d1
	bcc	1b

the 68060 needs only one clock per loop iteration - that is
the branch executes in zero time. Just what Motorolas mark^Wtechnical
documentation promised.

As long as there is at least an intervening instruction in between
branches, I expect this to happen also on PowerPC with dynamically
predicted branches in tight loops, or with statically (and correctly)
predicted branches; should also happen on other pipelined CPUs with
similar branch prediciton engines.

	-is