Re: Bloat

To: Masao Uebayashi <uebayasi%gmail.com@localhost>, der Mouse <mouse%rodents-montreal.org@localhost>, tech-kern%netbsd.org@localhost
Subject: Re: Bloat
From: David Laight <david%l8s.co.uk@localhost>
Date: Thu, 29 Jan 2009 21:03:53 +0000

On Wed, Jan 28, 2009 at 09:43:59PM -0500, Allen Briggs wrote:
> On Thu, Jan 29, 2009 at 11:26:08AM +0900, Masao Uebayashi wrote:
> > I've considered this.  We're really need to move toward modular and
> > stable ABI.  OTOH we may need tricks to run programs faster on slow
> > computers (older computers, embedded low-power processors, etc.).
> > It'd be a good compromisation to make important APIs  a function by
> > default, while preparing a way to make them "optimized" for speed
> > (inline, less indirection, etc.).  In the "optimized" case, ABI is not
> > kept.  So users can choose either ABI or speed.
> 
> And it's not quite that simple and straightforward in some ways.
> For vax (I think) inline makes a lot of sense because there's a
> significant function call overhead.  For ARM (and others?), inlining
> can be bad in some cases because it increases the code size, which
> can increase the memory footprint of the code and the number of
> instruction cache misses.  In that case, functions (and non-unrolled
> loops) *can* actually be better.

For non-superscaler cpus and cpus without significant (or any)
instruction cache, inlining and loop unrolling are probably gains
(if you can afford the code space).
So on a vax or 68xxx inlining and unrolling are probably wins.

On modern cpus with large caches, the ability to execute multiple
instructions in parallel, and memory speeds that are much lower than
execution speed, things are horribly different.

The following (at least) make a difference

- loop constructs can often be performed in parallel with the
  loop body.  With care this can mean that loop unrolling is pointless.

- instruction prefetch and decode will continue through an unconditional
  jump/call (and quite possibly return) without a pipeline stall.
  So subroutine call cost is minimal - apart from argument stacking.

- inlining and unrolling both reduce the likelyhood of code being
  in the cache (either from an earlier call to the same code, or
  simply because the additional code has displaced something that
  will be needed again).

- function calls could easily find the code already in the cache
  from a call somewhere else - particulary true for things like
  mutex code which is called very often.

On the downside, inlining can make a function into a 'leaf' which
typically gives the compiler many more registers to play with.

My 'gut feeling' is that it isn't worth inlining anything that
is likely to be longer than the call sequence.

I remember a problem with VN_RELE() being a #define, it looked quite
simple - but by the time the lock and spl calls had also been inlined
it was gross - and most of the calls were in error paths...
(not netbsd)

        David

-- 
David Laight: david%l8s.co.uk@localhost

Follow-Ups:
- Re: Bloat
  - From: Ignatios Souvatzis
- Re: Bloat
  - From: Andrew Doran
- Re: Bloat
  - From: der Mouse

References:
- Bloat
  - From: Andrew Doran
- Re: Bloat
  - From: Antti Kantee
- Re: Bloat
  - From: der Mouse
- Re: Bloat
  - From: Masao Uebayashi
- Re: Bloat
  - From: Allen Briggs

Prev by Date: Re: Bloat
Next by Date: Re: Bloat
Previous by Thread: Re: Bloat
Next by Thread: Re: Bloat
Indexes:

Home | Main Index | Thread Index | Old Index