tech-kern: Re: Efficient callout handling

Subject: Re: Efficient callout handling
To: Charles M. Hannum <mycroft@mit.edu>
From: Chris G Demetriou <Chris_G_Demetriou@ux2.sp.cs.cmu.edu>
List: tech-kern
Date: 07/17/1996 21:34:15
> It's been brought to my attention that more efficient callout handling
> would be beneficial, not only for the current uses of callouts, but
> even more so in cases where it might be preferrable to have a *huge*
> number of callouts at once.

Not to stand in the way of progress, but...

	Is this really progress?

rule of thumb number two of optimization is that you should measure
the system and find out the problems before you optimize it.  If
callout handling is not a significant drain on the system, as i would
naively guess it is not, then what's the big deal...  (rule of thumb
number one is "don't." 8-)

I'd like to know for sure that it is, and in what situations it is,
before it's optimized.


So, i guess my first objection is, it's not been adequately shown that
it needs optimization.  I'd like a profile of a heavily used network
server (or whatever) where timeout handling has been shown to have a
significant cost, or at least _some_ indication that it has (or in the
future, with new users) will have significant cost.

NetBSD has enough 'hard' problems left that (unless you're really,
really bored 8-) hacking a complex solution to a code path which
doesn't actually eat large amounts of resource(s) is a waste of time.


Now that the speech is done... 8-)


> The idea is to use a sort of radix tree, but with a twist.  You have
> HZ lists of callouts to execute within the current second, (say) 60
> lists of callouts to execute within the next 60 seconds, 60 lists
> divided by minute, 24 lists divided by hour, N lists divided by day.

I'm not so sure that I like this...

On Alpha systems, the interval timer is supposed to be set up so that
it causes at least 1000 interrupts per second.  NetBSD/Alpha (and
Digital UNIX) uses 1024.

Considering 1024 buckets and one pointer per bucket that's 8k.  2
pointers each is 16k.  I'm not convinced that it's worth paying 8k
(regardless of memory size, etc; people _have_ used NetBSD/Alpha on
low memory -- 16M -- systems) for the first second of timeout tables.


I'd like to see some numbers:

	(1) what's the actual distribution of timeout times?
	    (i.e. how long in the future until they go off?)

	    It may make sense to weight the low end even more heavily,
	    e.g. having sets of buckets for _several_ seconds,
	    just just the first.  It also may make sense to allocate
	    buckets differently (e.g. use logs with base 2 or 16, rather
	    than 60 when calculating 'depth' that a request belongs at).
	    But until you have the numbers, you can't tell.  8-)

	(2) What's the average queue depth now, and what would it
	    be under an this implementation?

Right now, you're talking about:

	1024 + 60 + 60 + 24 + N

queues on the Alpha.  that's 1168 + N.

Under the present system, the number of possible callouts (and i've
never run out, on any of my machines) is maxproc + 16.  On most of my
Alphas, that ends up being 164.  (it's relatively hard to bump up
maxproc because of the way the pmap code is written.)  The highest
maxproc i've seen on any NetBSD system (seen on ftp.netbsd.org, with
maxusers == 64) is 1044, so ncallout there would be 1060.

On ftp.netbsd.org (and x86 with HZ == 100), if you were using all of
the callouts, that'd be an average queue depth of 1060 / (100 + 60 +
60 + 24 + N) ~= 4 with the new implementation.  If that were an Alpha
configured with the same maxusers (since the maxproc and ncallout
calculation is machine-independent), that'd be an average queue depth
of less that 1 (which is wasteful).


It may make sense to allow the "tick buckets" actually be "n-tick
buckets," but then you'd need to keep them sorted, and your average
"real" insertion time would go up.


> When a particular point rolls over, you promote and resort the current
> list at the appropriate levels up 1 slot.

insert everything at the front of the 'minutes' list, and then do a
(linear time) radix sort to distribute over ticks?


It seems that you could end up having to do a lot of work at regular
intervals, i.e. once a second you'd take a big hit, once a minute
you'd take a big hit (plus a big hit for the "end of second"), once an
hour ...  If you end up having collapse multiple 'tick' buckets into
'n-tick' buckets, the work gets even greater since you have to sort
them...


> The overall time used for any given callout (whether deleted or
> executed) is thus O(log T), where T is the distance into the future.
> Since T is (especially when you include things like TCP timers) almost
> always <= 60 seconds, this means that the constant factor is actually
> fairly small.

Don't forget the constant multipliers!

that O(log T) potentially has a very high constant multiplier. (cost
of picking the bucket, e.g. for many values of hz the cost of
remainder op, which i call Cp below.  on many machines which have
remainder in hardware it's relatively slow, and on machines which
don't it's very slow.  Arguably, even for non-power-of-two values of hz
you can hack a faster solution, but it's still _relatively_ slow
compared to the cost of a simple comparison.)

If you allow 'tick' buckets to be collapsed into 'n-tick' buckets, the
cost is higher.


A balanced binary tree would be what?  Something like:

	insert:		   1 insertion * (O(Log2(N)) * Cc).  (Cc ==
			   cost of comparison.)

			   Best case: 1 * Cc
			   Average case: unknown
			   Worst case: Log2(N) * Cc

	extraction (of first): Log2(N) * Cc ?

(I say average case unknown because w/o knowing the actual
distribution of timeouts it's impossible to figure out what
the average depth of an insertion is...)


Given that Cp can potentially be very large compared to Cc
(i.e. tens, hundreds or thousands of times, depending on
architecture, HZ, and how this code is actually written), unless we
know that the number of callouts is Large it could very well be that a
balanced binary tree or similar structure is better.


In other words, i'm not yet convinced that this solution is:

	(1) necessary.  (i.e. i've not been shown that there is
	    a real problem here.)

	(2) less wasteful of memory than it should be.

	(3) 'nice' to the system.  It causes regular, potentially
	    large bursts of of CPU load completely asynchronous with
	    callout insertion/processing.  At the very least that
	    makes it hard to charge properly for that CPU usage.
	    At worst, it interferes with other system operation.
	    Wouldn't resorting have to happen with access to the
	    buckets disabled, i.e. at splhigh?

	(4) actually more efficient than a system which is
	    more efficient, 'nicer,' and less creative.  8-)


Some numbers that show what real usage, both in terms of actual
timeout handling cost and in terms of the distribution and
use of timeouts, would go a long way to showing whether or not
this is necessary at all, and if necessary, what types of load
it should be optimized for.



chris