Re: Proposal for kernel clock changes

To: David Laight <david%l8s.co.uk@localhost>
Subject: Re: Proposal for kernel clock changes
From: Dennis Ferguson <dennis.c.ferguson%gmail.com@localhost>
Date: Wed, 2 Apr 2014 09:05:18 -0700
On 1 Apr, 2014, at 12:50 , David Laight <david%l8s.co.uk@localhost> wrote:
> On Fri, Mar 28, 2014 at 06:16:23PM -0400, Dennis Ferguson wrote:
>> I would like to rework the clock support in the kernel a bit to correct
>> some deficiencies which exist now, and to provide new functionality.  The
>> issues I would like to try to address include:
> 
> A few comments, I've deleted the body so they aren't hidden!

Thanks very much for looking at it.  I know that reading about
clocks is, for most people, a good way to put oneself to sleep
at night.

> One problem I do see is knowing which counter to trust most.
> You are trying to cross synchronise values and it might be that
> the clock with the best long term accuracy is a very slow one
> with a lot of jitter (NTP over dialup anyone?).
> Whereas the fastest clock is likely to have the least jitter, but
> may not have the long term stability.

This is true but when considering the quality of non-special-purpose
computer clock hardware running on its own, either on the CPU board
or on an ethernet card, what you'll effectively end up trying to
determine by this is whether the clock is just crappy, or is crappier
than that. The stability of cheap, uncompensated free-running
crystals is always poor, you shouldn't trust any of these these unless
you have no choice, and life is too short to worry about trying to
measure degrees of crappiness.

Since all the clocks in your system are likely to be crappy if left
running free the "best" clock in the system will always be the one
which is making the most accurate measurements of the most accurate
external time source you have available and steering itself to that.
The only important "quality" of a clock is how well it is measuring
its time source and how good that time source is.  The measurement
clocks are only useful if you have an application which is interested
in taking and processing those measurements, and if that application
is not broken it will certainly come to some opinion about which of
those clocks is the best one based on those measurements.  That will be
the clock the time comes from, the polling is the mechanism to get it
to the others. The kernel itself will see the polling and see adjustments
being made to clocks but it will be the application which knows why that
is being done and which way the time is moving.  If there are no
external time sources, however, you'll probably just live with whatever
your chosen system clock does and not worry about the measurement clocks.

> There are places where you are only interested in the difference
> between timestamps - rather than needing them converting to absolute
> times.

I'm not quite sure how to read that, but I'll guess.  I over-simplified
the description of what is being maintained a bit.  I'm fond of, and
the system call interface I like makes use of, the two timescales the
kernel maintains now, i.e.

    time = uptime + boottime;

where `time' has an UTC-aligned epoch, `uptime's epoch is around the
time the machine was booted, and boottime is a mostly-constant value
which expresses uptime's epoch in terms of time's epoch.  uptime is
maintained to advance at the same rate as time but to be phase
continuous, which means that uptime will advance at as close to the
rate of the SI second as we can determine it (since it advances
at the same rate as time, which advances at the rate of UTC, which
advances at the rate of the SI second) but is unaffected by step
changes made to time make to bring it into phase alignment to UTC
(boottime changes instead).  uptime hence tracks UTC's frequency but
not its phase.

If you want to measure the interval between timestamps, then, I think
you would take your timestamps in terms of uptime and then compute

    interval = uptime[1] - uptime[0];

which should reliably give you system's best estimate of the elapsed
number of SI seconds between the times the two stamps were acquired.
I like to record event timestamps in terms of uptime as well since it
makes it unambiguous when the events occurred even if someone calls
clock_settime() in between.  Also, the tuple describing a conversion
from a tickcount_t tc to a systime_t, which I over-simplified, actually
maintains the pair of timescales by maintaining two `c' values, so that

    time = (tc << s) * r + c_time;
    uptime = (tc << s) * r + c_uptime;

and

    boottime = c_time - c_uptime;

So if "absolute time" means UTC, in the form of UTC-aligned `time',
then I agree.  You can't reliably compute time intervals from two
UTC timestamps since, almost unavoidably, some day the system's
estimate of UTC will be wrong and will require a step change to
fix, and you'll compute a bogus time interval if your timestamps
straddle that.  On the other hand, if avoiding "needing them
converted to absolute times" means hanging on to the raw
tickstamp/tickcount for an extended period then I don't see
the point.  The conversion isn't very expensive, and a pair of
uptime timestamps taken from the system clock will reliably
allow you to compute intervals in SI seconds (or the system's
best estimate of SI seconds), which is probably what you'd like
to know.

> This may mean that you can (effectively) count the ticks on all your
> clocks since 'boot' and then scale the frequency of each to give the
> same 'time since boot' - even though that will slightly change the
> relationship between old timestamps taken on different clocks.
> Possibly you do need a small offset for each clock to avoid
> discrepencies in the 'current time' when you recalculate the clocks
> frequency.

The rate of advance which clock synchronization software sets
the clock to is actually a prediction of clock performance in the
immediate future based on measurements of as little of the clock's
recent past behaviour as the software thinks might be useful.  The
problem with crappy clocks is that the number changes a lot, and
while longer averages help if you have a constant signal in zero-mean
noise they make things worse if the signal itself is a moving target,
which is what the clock's frequency is like.  A current frequency
hence won't tell you very much about the very distant past (or
distant future).

Note, however, that the frequency changes are being made explicitly
to make the ratio of "uptime seconds"/"SI seconds" as close to unity
as possible at all times, and the longer the interval the closer to
unity it will generally be.  If you keep uptime timestamps you can
compute intervals in SI seconds with considerable precision; uptime
itself is our best estimate of the number of SI seconds since boot.
Also, I would expect most applications to exclusively take their
timestamps from the system clock (the point of making measurements
is to make the system clock as accurate as possible) and, while the
hardware source of the system clock might (rarely) be changed, it
will make this change in a way which keeps uptime as continuous as
it can even if the raw tickstamps look very different.

> If the 128bit divides are being done to generate corrected frequences,
> it might be that you can use the error term to adjust the current value
> - and remove the need for the divide at all (after the initial setup).

The user interface expresses rate changes as a sysrate_t, which lets
the new value of the `r' rate constant to be computed with a 128 bit
multiply.  The current code I have uses the divide in the kernel in
3 spots:

- It needs the 128 divide once to compute the nominal value of `r' from
  each clock's counter frequency, which is done when a clock is initialized.

- It needs it to compute the tickcount_t time of a change that needs
  to be scheduled for a future moment, like the end of a slew a la
  adjtime() or a leap second.

- The system call interface doesn't promise to do exactly the adjustment
  you tell it to, but does promise to tell you exactly what it ended up
  doing.  For rate changes it currently does a divide to figure out
  the rate it is actually setting in sysrate_t form to return to the
  caller since the different precision of the 'r' rate constant can
  change what you asked for by a couple of low order bits (i.e. 10^-18).

Clearly the last could go away if I get over being anal about precision,
though I'm not sure it has to.  The worst case machine I've looked at
so far is arm, which has no hardware divide instructions at all and
relatively slow multiplies, yet my 1 GHz Cortex A8 can do the 128 bit
divide with shifts, adds and multiplies in about 250 ns, mostly.
That's the same as maybe two or three cache misses.  The only other things
I've measured were on a 2.4 GHz amd64 machine which did the divide in about
22 ns with the 128 bit divide instruction it has, or 28 ns with the C
function that every other machine uses, and the same machine running
i386 code which I remember as being below 40 ns.  And the benefit that
having a fine rate-of-advance adjustment pays for is that it should
allow the clock to be maintained as accurately as it can be with a
minimal rate of adjustment, so ideally the divides won't need to be
done very often.
  
> One thought I've sometimes had is that, instead of trying to synchronise
> the TSC counters in an SMP system, move them as far from each other
> as possible!
> Then, when you read the TSC, you can tell from the value which cpu
> it must have come from!

I need to get a machine with more than one CPU socket at some point.
My current approach has been to bail and use some other clock at the
first sign of trouble...

Dennis Ferguson
References:
- Proposal for kernel clock changes
  - From: Dennis Ferguson
- Re: Proposal for kernel clock changes
  - From: David Laight
Prev by Date: Re: asymmetric smp
Next by Date: Re: Proposal for kernel clock changes
Previous by Thread: Re: Proposal for kernel clock changes
Next by Thread: Re: Proposal for kernel clock changes
Indexes:
Home | Main Index | Thread Index | Old Index