Subject: Re: Compiler tweaking.
To: Chris Gilbert <chris@dokein.co.uk>
From: Richard Earnshaw <rearnsha@netbsd.org>
List: port-arm
Date: 10/13/2004 11:06:45
On Mon, 2004-10-11 at 00:20, Chris Gilbert wrote:
> Hi,
> 
> I was wondering if someone could point me to where/how to do tweaks to the arm backend for gcc.  I've found that we do a few bits and pieces sub-optimally and would like to investigate (unless someone else has more time to delve)
> 
> Basically while looking over some pmap code, and the produced asm, I noted that we sometimes do branches to an instruction that also does a branch, this seems rather extreme, and inefficient for the pipeline.
> 
> eg: tail end of pmap_page_remove is:
> 	if (flush) {
> 		if (PV_BEEN_EXECD(flags))
> 			pmap_tlb_flushID(curpm);
> 		else
> 			pmap_tlb_flushD(curpm);
> 	}
> 	cpu_cpwait();
> }
> 
> This seems to get assembled to:
> 	.loc 1 1875 0
> 	and	r3, r7, #18
> 	cmp	r3, #18
> 	beq	.L471
> 	.loc 1 585 0
> .LBB78:
> 	ldrb	r3, [r9, #15]	@ zero_extendqisi2
> 	cmp	r3, #0
> 	bne	.L472
> 	.loc 1 1881 0
> .L432:
> 	ldmea	fp, {r4, r5, r6, r7, r8, r9, sl, fp, sp, pc}
> 	.loc 1 586 0
> .L472:
> 	ldr	r3, .L473+16
> 	mov	lr, pc
> 	ldr	pc, [r3, #44]
> 	.loc 1 587 0
> 	strb	r4, [r9, #15]
> 	b	.L432
> 	.loc 1 575 0
> .L471:
> 
> What surprised me is that .L432 is branched to, rather than the one
> line being duplicated, eg b .L432, could just have easily been the
> ldmea, and save doing an extra pipeline flush.  Also the bne .L472
> used to jump the across the ldmea seems odd, as I'd have expected the
> ldmea to have been conditionally executed instead of doing a branch
> around it
> 
No, you *NEVER* want conditionally execute a likely-false ldm on a
StrongARM chip, it takes one cycle per register in the load list even
when the condition fails.  Remember that the StrongARM's pipeline can
execute a branch instruction in about 2 cycles, so the cost of jumping
round an insn is quite low.

> To me this suggests the compiler is missing an optimisation.  The easy
> one being to look at the target of a branch and seeing if it's another
> branch, if so avoid doing the double hop.
> 
> The other issue looks more complex, as it appears that .L432 may
> actually be the original chunk of return code.
> 

I've seen instances of this in the past, but haven't sat down to work
out yet why it's happening (I suspect the ultimate reason is that there
are two mechanisms for describing a return, one which can be inlined,
and the other which cannot; but the question is why it is not choosing
the latter rather than the former in some circumstances).

> If someone can point me at or make suggestions on how to tweak gcc
> backends I'd be grateful, as I suspect these kind of optimisations
> would help performance.  Performance being one of my reasons, I'm
> attempting to do a pkgsrc-2004Q3 bulk build on a cats (2.0_RC4) box,
> and it's spending a large chunk of time in system.

Not as much as you might expect.  Given the StrongARM's branch cost,
you'd have to execute ~100 million return instructions[1] @ 200MHz to
see a second's worth of improvement -- that represents a tremendous
amount of executed code.

R.

[1] Assumes a perfect cache of course!