port-arm: Compiler tweaking.

Subject: Compiler tweaking.
To: None <port-arm@netbsd.org>
From: Chris Gilbert <chris@dokein.co.uk>
List: port-arm
Date: 10/11/2004 00:20:46
Hi,

I was wondering if someone could point me to where/how to do tweaks to the arm backend for gcc.  I've found that we do a few bits and pieces sub-optimally and would like to investigate (unless someone else has more time to delve)

Basically while looking over some pmap code, and the produced asm, I noted that we sometimes do branches to an instruction that also does a branch, this seems rather extreme, and inefficient for the pipeline.

eg: tail end of pmap_page_remove is:
	if (flush) {
		if (PV_BEEN_EXECD(flags))
			pmap_tlb_flushID(curpm);
		else
			pmap_tlb_flushD(curpm);
	}
	cpu_cpwait();
}

This seems to get assembled to:
	.loc 1 1875 0
	and	r3, r7, #18
	cmp	r3, #18
	beq	.L471
	.loc 1 585 0
.LBB78:
	ldrb	r3, [r9, #15]	@ zero_extendqisi2
	cmp	r3, #0
	bne	.L472
	.loc 1 1881 0
.L432:
	ldmea	fp, {r4, r5, r6, r7, r8, r9, sl, fp, sp, pc}
	.loc 1 586 0
.L472:
	ldr	r3, .L473+16
	mov	lr, pc
	ldr	pc, [r3, #44]
	.loc 1 587 0
	strb	r4, [r9, #15]
	b	.L432
	.loc 1 575 0
.L471:

What surprised me is that .L432 is branched to, rather than the one line being duplicated, eg b .L432, could just have easily been the ldmea, and save doing an extra pipeline flush.  Also the bne .L472 used to jump the across the ldmea seems odd, as I'd have expected the ldmea to have been conditionally executed instead of doing a branch around it

To me this suggests the compiler is missing an optimisation.  The easy one being to look at the target of a branch and seeing if it's another branch, if so avoid doing the double hop.

The other issue looks more complex, as it appears that .L432 may actually be the original chunk of return code.

If someone can point me at or make suggestions on how to tweak gcc backends I'd be grateful, as I suspect these kind of optimisations would help performance.  Performance being one of my reasons, I'm attempting to do a pkgsrc-2004Q3 bulk build on a cats (2.0_RC4) box, and it's spending a large chunk of time in system.

Thanks,
Chris