Subject: Re: Compiler tweaking.
To: Richard Earnshaw <rearnsha@netbsd.org>
From: Chris Gilbert <chris@dokein.co.uk>
List: port-arm
Date: 10/13/2004 23:23:13
On Wed, 13 Oct 2004 11:06:45 +0100
Richard Earnshaw <rearnsha@netbsd.org> wrote:
> On Mon, 2004-10-11 at 00:20, Chris Gilbert wrote:
> > What surprised me is that .L432 is branched to, rather than the one
> > line being duplicated, eg b .L432, could just have easily been the
> > ldmea, and save doing an extra pipeline flush. Also the bne .L472
> > used to jump the across the ldmea seems odd, as I'd have expected the
> > ldmea to have been conditionally executed instead of doing a branch
> > around it
> >
> No, you *NEVER* want conditionally execute a likely-false ldm on a
> StrongARM chip, it takes one cycle per register in the load list even
> when the condition fails. Remember that the StrongARM's pipeline can
> execute a branch instruction in about 2 cycles, so the cost of jumping
> round an insn is quite low.
Ouch, I didn't realise it had that effect on timing, in which case it really isn't a problem then :)
> > To me this suggests the compiler is missing an optimisation. The easy
> > one being to look at the target of a branch and seeing if it's another
> > branch, if so avoid doing the double hop.
> >
> > The other issue looks more complex, as it appears that .L432 may
> > actually be the original chunk of return code.
> >
>
> I've seen instances of this in the past, but haven't sat down to work
> out yet why it's happening (I suspect the ultimate reason is that there
> are two mechanisms for describing a return, one which can be inlined,
> and the other which cannot; but the question is why it is not choosing
> the latter rather than the former in some circumstances).
Perhaps it's related to the expected cost of the ldm...
> > If someone can point me at or make suggestions on how to tweak gcc
> > backends I'd be grateful, as I suspect these kind of optimisations
> > would help performance. Performance being one of my reasons, I'm
> > attempting to do a pkgsrc-2004Q3 bulk build on a cats (2.0_RC4) box,
> > and it's spending a large chunk of time in system.
>
> Not as much as you might expect. Given the StrongARM's branch cost,
> you'd have to execute ~100 million return instructions[1] @ 200MHz to
> see a second's worth of improvement -- that represents a tremendous
> amount of executed code.
Doh, yes if I actually think about these things I'd have realised that. For some reason I thought branches cost 5 sycles, but perhaps that's xscales...
In fact investigating further shows some numbers on a profiling kernel. During 207s of system time, the top few biggies are:
Each sample counts as 0.002 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
7.39 15.37 15.37 10583344 0.00 0.00 lockmgr
4.06 23.82 8.44 1613502 0.01 0.01 cache_lookup
3.51 31.12 7.31 472628 0.02 0.14 uvm_fault
3.23 37.85 6.72 1305413 0.01 0.01 pmap_enter
2.81 43.69 5.85 65111 0.09 0.09 bcopy_page
2.81 49.53 5.84 384353 0.02 0.16 data_abort_handler
2.48 54.68 5.15 121001 0.04 0.04 bzero_page
1.87 58.57 3.90 457261 0.01 0.26 esigcode
1.87 62.46 3.88 463551 0.01 0.01 sa1_cache_purgeD_rng
1.72 66.03 3.58 313576 0.01 0.01 copyout
1.68 69.53 3.50 1498044 0.00 0.00 nfs_access
1.63 72.92 3.39 1536459 0.00 0.00 pool_get
1.53 76.11 3.18 9942269 0.00 0.00 acquire
1.52 79.27 3.16 236570 0.01 0.03 pmap_remove
1.47 82.33 3.07 294844 0.01 0.03 genfs_getpages
1.45 85.35 3.02 2595119 0.00 0.00 memcpy
1.43 88.32 2.97 112793 0.03 0.11 lookup
1.37 91.18 2.85 1475243 0.00 0.02 nfs_lookup
1.24 93.76 2.58 36380 0.07 0.10 ltsleep
1.20 96.26 2.50 1585508 0.00 0.00 nfs_getattrcache
1.19 98.72 2.46 4410297 0.00 0.00 vn_lock
1.15 101.10 2.38 1035595 0.00 0.00 uvm_map_lookup_entry
1.01 103.20 2.09 141281 0.01 0.03 uvm_map
The sheer volume of locking calls is quite amazing, but then this was doing a bulk pkgsrc build, and it mostly seemed to be playing with files on an NFS share, as it was creating the dependancy information.
However I think I found one cause of the system time, it appears that cats GENERIC kernel has DIAGNOSTICS turned on, I remember that i386's generic has that turned off these days, as it's been found to be expensive.
I guess there's not going to be a magic optimisation that would help with performance. Perhaps the pkgsrc stuff is at fault for dumping intermediate files into the pkgsrc source dir, rather than using a local disk.
Thanks,
Chris