port-arm: Re: Compiler tweaking.

Subject: Re: Compiler tweaking.
To: Richard Earnshaw <rearnsha@netbsd.org>
From: Chris Gilbert <chris@dokein.co.uk>
List: port-arm
Date: 10/13/2004 23:23:13
On Wed, 13 Oct 2004 11:06:45 +0100
Richard Earnshaw <rearnsha@netbsd.org> wrote:

> On Mon, 2004-10-11 at 00:20, Chris Gilbert wrote:
> > What surprised me is that .L432 is branched to, rather than the one
> > line being duplicated, eg b .L432, could just have easily been the
> > ldmea, and save doing an extra pipeline flush.  Also the bne .L472
> > used to jump the across the ldmea seems odd, as I'd have expected the
> > ldmea to have been conditionally executed instead of doing a branch
> > around it
> > 
> No, you *NEVER* want conditionally execute a likely-false ldm on a
> StrongARM chip, it takes one cycle per register in the load list even
> when the condition fails.  Remember that the StrongARM's pipeline can
> execute a branch instruction in about 2 cycles, so the cost of jumping
> round an insn is quite low.

Ouch, I didn't realise it had that effect on timing, in which case it really isn't a problem then :)
 
> > To me this suggests the compiler is missing an optimisation.  The easy
> > one being to look at the target of a branch and seeing if it's another
> > branch, if so avoid doing the double hop.
> > 
> > The other issue looks more complex, as it appears that .L432 may
> > actually be the original chunk of return code.
> > 
> 
> I've seen instances of this in the past, but haven't sat down to work
> out yet why it's happening (I suspect the ultimate reason is that there
> are two mechanisms for describing a return, one which can be inlined,
> and the other which cannot; but the question is why it is not choosing
> the latter rather than the former in some circumstances).

Perhaps it's related to the expected cost of the ldm...

> > If someone can point me at or make suggestions on how to tweak gcc
> > backends I'd be grateful, as I suspect these kind of optimisations
> > would help performance.  Performance being one of my reasons, I'm
> > attempting to do a pkgsrc-2004Q3 bulk build on a cats (2.0_RC4) box,
> > and it's spending a large chunk of time in system.
> 
> Not as much as you might expect.  Given the StrongARM's branch cost,
> you'd have to execute ~100 million return instructions[1] @ 200MHz to
> see a second's worth of improvement -- that represents a tremendous
> amount of executed code.

Doh, yes if I actually think about these things I'd have realised that.  For some reason I thought branches cost 5 sycles, but perhaps that's xscales...

In fact investigating further shows some numbers on a profiling kernel.  During 207s of system time, the top few biggies are:

Each sample counts as 0.002 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
  7.39     15.37    15.37 10583344     0.00     0.00  lockmgr
  4.06     23.82     8.44  1613502     0.01     0.01  cache_lookup
  3.51     31.12     7.31   472628     0.02     0.14  uvm_fault
  3.23     37.85     6.72  1305413     0.01     0.01  pmap_enter
  2.81     43.69     5.85    65111     0.09     0.09  bcopy_page
  2.81     49.53     5.84   384353     0.02     0.16  data_abort_handler
  2.48     54.68     5.15   121001     0.04     0.04  bzero_page
  1.87     58.57     3.90   457261     0.01     0.26  esigcode
  1.87     62.46     3.88   463551     0.01     0.01  sa1_cache_purgeD_rng
  1.72     66.03     3.58   313576     0.01     0.01  copyout
  1.68     69.53     3.50  1498044     0.00     0.00  nfs_access
  1.63     72.92     3.39  1536459     0.00     0.00  pool_get
  1.53     76.11     3.18  9942269     0.00     0.00  acquire
  1.52     79.27     3.16   236570     0.01     0.03  pmap_remove
  1.47     82.33     3.07   294844     0.01     0.03  genfs_getpages
  1.45     85.35     3.02  2595119     0.00     0.00  memcpy
  1.43     88.32     2.97   112793     0.03     0.11  lookup
  1.37     91.18     2.85  1475243     0.00     0.02  nfs_lookup
  1.24     93.76     2.58    36380     0.07     0.10  ltsleep
  1.20     96.26     2.50  1585508     0.00     0.00  nfs_getattrcache
  1.19     98.72     2.46  4410297     0.00     0.00  vn_lock
  1.15    101.10     2.38  1035595     0.00     0.00  uvm_map_lookup_entry
  1.01    103.20     2.09   141281     0.01     0.03  uvm_map

The sheer volume of locking calls is quite amazing, but then this was doing a bulk pkgsrc build, and it mostly seemed to be playing with files on an NFS share, as it was creating the dependancy information.

However I think I found one cause of the system time, it appears that cats GENERIC kernel has DIAGNOSTICS turned on, I remember that i386's generic has that turned off these days, as it's been found to be expensive.

I guess there's not going to be a magic optimisation that would help with performance.  Perhaps the pkgsrc stuff is at fault for dumping intermediate files into the pkgsrc source dir, rather than using a local disk.

Thanks,
Chris