Subject: Re: Preparing callout(9) for a HZ-less kernel
To: None <>
From: Joerg Sonnenberger <>
List: tech-kern
Date: 10/23/2007 14:10:46
On Tue, Oct 23, 2007 at 08:02:18AM +0100, David Laight wrote:
> On Thu, Oct 18, 2007 at 09:44:35PM +0200, Joerg Sonnenberger wrote:
> > Hi all,
> > attached patch is a first big step to allow operation with dynamic
> > timers.
> Why do you think this is a good idea?
> My 'gut feeling' is that it is a bad one.
> I can see reasons for not actually taking an interrupt on every tick,
> but if you remove 'ticks' completely then you will have problems
> with busy systems taking far too many timer interupts - since you will
> rarely do the work for more than one expired callout per interrupt.

Because it also can mean that we don't have to interrupt every 10ms or
so. I don't buy the busy system due to timeouts, but that can easily be
avoided as I said. When reprogramming the timer, you just have to ensure
that you wait a minimum time to bound the interrupt rate, e.g. 1ms.

> > I don't think that the hashed time wheel works very effective
> > for the much larger possible range of values without adding a lot of
> > movement for the timers,
> Apart from the requirement to scan empty slots - which should be quite
> fast if there aren't too many of them - it should be ok provided you
> restrict the granularity a bit.
> Remember that the large majority of timers (esp in network code) are
> 'guard' timers that don't expire.

Keep in mind that it also means that you push them around in the call
wheel a lot, which is not free either.

> > so the red black tree should be as efficient.
> I wouldn't bet on it, by far the most common timer operations are
> 'start' and 'stop', values are also not at all random, and removing
> nodes repeatedly from one end (as timers expire) is also bad.
> This means that the rb-tree is likely to be continually doing rebalance
> operations.
> A 'heap' (where the 'first' node) appears at the root would be better.

I'm so sure about the heap. We can't just use the classic binary heap as
that needs a resizable array we don't have. I'm not sure if Fibonacci
heaps would be better, esp. as the expensive operation is still the
remove and only the insert is made much cheaper.

For callout_stop, I would expect it to be a relative rare operation, so
optimising that doesn't matter. callout_reset/callout_schedule happens a
lot more often, but I don't think it is performance critical either. It
happens a lot to extend the trigger period and it should be analyzed if
a lazy update of the tree is more efficient. I doubt it though.

> > Initial benchmarks from rmind@ have shown a small regression for the
> > MySQL benchmark,
> IMHO not surprising.

I expect that to be mostly related to the nanotime call and not so much
to the use of the rb tree. This is consistent with some additional tests
by rmind@.

> > but I hope the move to dynamic timers will ultimately
> > justify this.
> > 
> > With dynamic timers I mean to allow platforms which programmable
> > one-shot timers to schedule the callouts on a more precise base.
> Which isn't a good idea!
> Why not look at increasing the timer resolution to (say) 1ms, but
> using the 1-shot to avoid taking interrupts when nothing is waiting.

Read the paper Thor mentioned to get an idea why high resolution timers
are useful. Based on that I question that a call wheel of a few thousand
entries is really that efficient anymore.