Subject: Re: Proposed changes to cc_microset().
To: Simon Burge <tech-kern@NetBSD.org, simonb@wasabisystems.com>
From: Frederick Bruckman <fredb@immanent.net>
List: tech-kern
Date: 07/06/2004 00:24:44
In article <20040705233831.8AB4323413@thoreau.thistledown.com.au>,
	simonb@wasabisystems.com (Simon Burge) writes:
> Frederick Bruckman wrote:
> 
>> So, the point of the patch, is, to call cc_microset() more frequently,
>> to update the base time for cc_microtime(), while actually doing the
>> frequency calibration less frequently (1/10 second and 10 seconds,
>> respectively).
>> 
>> I chose 1/10 second for the shorter interval, because all the port's
>> "Hz" are divisible by 10.
> 
> I haven't looked at the math close enough yet, but does will a port with
> hz != 10 still work ok?  There's ports that use hz = 1024, 512, 256, 64
> and 25, and some port let you define your own values.  In particular,
> all the ports that use the "mcclock" device use 256 Hz as a default.

Oh, I didn't realize that we had ports with all those possibilities. It
should still work, though. The only part that assumes 1/10 second is the
sanity check; that may need some tweaking.
 
> It looks like if you want to keep the 10 second long period you just
> need to update the calculation of ci_cc_microset_count, and adjust the
> "one tenth of a second" and "hundred times" comments.

It could say "about 1/10 second" or "about 10 seconds", or it could say
"short interval" and "long interval", and let the actual numbers be varied
(by experiment) for each port.
 
>> The burden of calling cc_microset() more frequently is offset, to some
>> extent, by the fact that it does less on all but 1 in 100 calls. Is
>> that acceptable? Note that the slowest ports don't even have a cycle
>> counter, and so will never call cc_microset().
> 
> I'm guessing a profiled kernel wouldn't show any useful info since we're
> analysing something at clock interrupt time.  It should be easy to rig
> up some rudimentry timing info to see how many clock cycles that the
> 1-in-100 and 99-in-100 calls actually take.
> 
> For the SMP case, your new code results in ten times the number of microset
> IPIs.  I really have no idea what sort of performance impact that has.

Right. I don't know either. That's definitely the biggest possible
downside. For what it's worth, it doesn't *feel* any more sluggish.
 
> Should ci_cc_microset_cc_lo/hi be uint32_t instead of int64_t?

You are correct. The overflow is never stored in ci_cc_microset_cc_lo,
as it is in ci_cc_cc, so it doesn't need to be more than 32 bits.
 
> I'm not understanding the relationship between ci_cc_microset_cc_lo/hi
> and ci_cc_cc either.  We only set ci_cc_microset_cc_lo once every ten
> seconds, and possibly increment ci_cc_microset_cc_hi every now and then
> (six or seven times over 10 seconds on a 3GHz box?), and use them in the
> following calculation:
> 
>                 denom = (ci->ci_cc_microset_cc_hi * 0x100000000LL) +
>                     (cc - ci->ci_cc_microset_cc_lo);
> 
> I'm missing something there...

cc_microtime() only uses ci_cc_time and ci_cc_cc, set on the shorter
interval (and the ratio of course, computed on the longer interval).
What's more, it can safely assume that if cc is less than ci_cc_cc,
the counter has rolled over.

In order for cc_microset() to utilize a longer interval to calibrate the
ratio, it needs to know if the 32-bit counter has rolled over, and how
many times, and the only way to do that, is to track it on the short
interval. (OK, so it's not the *only* way: you could try to estimate.)
So at the end of the long interval you have ci_cc_microset_cc_hi containing
the precise amount of times the counter has rolled over (i.e the high bits).
ci_cc_microset_cc_lo contains the counter at the beginning of the longer
interval (so the difference has the low bits).

> I also wonder if the two expressions in:
> 
>         if (delta < (100000 - (1000000 / hz) / 2) || 
>              delta > (100000 + (1000000 / hz) / 2)) {
> 
> might be better off being pre-calculated or not?  Divides can take a
> long time on some architectures.  On the other hand, divides might
> be quicker than a memory load on others (though we need to load "hz"
> anyway)...

I suppose we could just use the tick global for that, so it could be:

	if (delta < (100000 - (tick >> 1)) ||
	    delta > (100000 + (tick >> 1))) {
 
> I just occurs to me that as we pass 4GHz boxes the current "once per
> second" thing isn't going to be good enough, is it?  Certainly from just
> this alone, we need to make some changes soon and from what I understand
> your changes should handle this.  That said when we hit 40GHz boxes
> (only another factor of ten, and it wasn't that long ago that 400MHz
> boxes were fast) when we will need to set ci_cc_microset_count so that
> we're called even more than every 1/10th of a second...

By then, we'll probably want to use the full 64-bit cycle counter on i386,
and let the code diverge from ports which don't have that. With some minor
modifications, we could even fake the 64-bit counter here, for them.
 
>> Another thing -- after building with CC_VERBOSE, "gdb" can't actually
>> set cc_verbose to "0" to turn it off. ("gdb" dumps core.) Can anyone
>> tell me what I'm doing wrong?
> 
> No idea off hand, but you shouldn't need a "volatile" in the declaration
> of cc_verbose.  Does it help removing that?

I'll try that. Thanks for the suggestion.
 
> I'd also change the name of that cpp flag to CC_DEBUG or DEBUG_CC
> (something with "DEBUG" in it) to keep in line with other debug flag
> names.

Sure. (I was originally going to only print the value when it changed,
maybe a few times a day. On the P4 SMP box, though, it always changes,
so I gave up on that idea.)


Frederick