Proposal for kernel clock changes

To: "tech-kern%netbsd.org@localhost List" <tech-kern%netbsd.org@localhost>
Subject: Proposal for kernel clock changes
From: Dennis Ferguson <dennis.c.ferguson%gmail.com@localhost>
Date: Fri, 28 Mar 2014 18:16:23 -0400
I would like to rework the clock support in the kernel a bit to correct
some deficiencies which exist now, and to provide new functionality.  The
issues I would like to try to address include:

- It has become common for systems to include clocks which are unsuitable
  for use as the time source for the system clock but which are none-the-less
  useful because they are the timestamp source for (hardware) measurements of
  external events.  The most frequently encountered example of this may be the
  counter included in many Ethernet MAC chips which is sampled when IEEE 1588
  packets are sent and received; many systems may have more than one of these.
  Peripherals which hardware timestamp other types of events (e.g. signal
  transitions, like the PPS output of a GPS receiver) are often found in
  integrated SoCs, as are devices which use a free-running counter to generate
  events.

  Making all of these "measurement" clocks useful to the system seems to
  require two things.  It first requires that each of these clocks be visible
  to and indepently adjustable via a clock adjustment interface.  The only
  thing one can do with accurate-in-time external events which are measured
  with a particular clock is to use the information to adjust that particular
  clock into synchronization, so each such clock must be independently
  adjustable.

  The second requirement is that it must be possible to measure the times
  of independent pairs of clocks in the system against each other as
  precisely as possible, perhaps with a sequence of the form

    read clock A
    read clock B
    read clock A

  to provide an estimate of both the offset between the clocks and the
  uncertainty/ambiguity of the measurement itself.  The reason for this
  is that having a precisely synchronized measurement clock in, say, an
  Ethernet MAC chip is clearly fairly useless by itself.  Its time becomes
  useful only when it can be transferred to the system clock and/or other
  clocks in need of synchronization so that other applications can use it too.

- Acquiring a timestamp from a clock is generally done by (1) reading a
  value from a hardware register, then (2) manipulating the value and doing
  the arithmetic necessary to convert it to a time of day.  I would like
  to be able to separate (1) from (2), storing the raw output from the
  hardware now (I've been calling this a "tickstamp") but deferring the
  work of converting it to a meaningful timestamp until a bit later.  An
  example use of this might be to tickstamp every packet arrival early
  in a network controller's interrupt routine but only go to the trouble
  of converting the tickstamp to a timestamp if the packet arrives
  somewhere which cares about this (e.g. an NTP or PTP socket), with
  unused tickstamps perhaps being donated to the random number pool.

  Event timestamping has many uses, and with a suitably inexpensive hardware
  time source capturing event times in tickstamp form has many advantages.
  It minimizes the overhead in the case that not all tickstamps are
  consumed as timestamps, it often allows the bulk of the work of acquiring
  a timestamp to be moved from time critical code to code with more relaxed
  constraints, and is probably the most appropriate way to provide random
  number pool entropy.  The clock-pair polling above might be implemented as

    read clock A tickstamp
    read clock B tickstamp
    read clock A tickstamp

  with the corresponding timestamps being sorted out after the time-critical
  polling code segment has been completed.  It might also be possible to
  provide arithmetic functions computing nominal time intervals directly
  from tickstamps themselves to make the implementation of things like
  Codel's packet timestamping more economical.

- The fundamental problem that clock synchronization software needs to
  deal with is that the oscillator (likely a crystal soldered to a board
  somewhere) driving the clock the software is trying to keep correct
  makes errors: the output frequency of the oscillator usually differs
  considerably from the number written on its package and will measurably
  vary with time.  The operating system will likely be setting the rate
  of advance of the digital clock being driven by the oscillator based on
  the number written on the oscillator's package, the job of clock
  synchronization software is to measure the actual frequency of the
  oscillator and to update the clock's rate of advance to match.

  The traditional BSD clock adjustment interface, adjtime(2), provides
  only a slewing phase (i.e. time) adjustment.  There is no way to directly
  alter the clock's rate of advance to reduce the need for phase adjustments,
  nor is the adjtime(2) slew rate adjustable, and by modern standards the
  precision of the phase adjustments it implements is quite limited.  The
  optional NTP adjustment interface does provide a frequency adjustment, but
  the resolution is lower than the measurement precision can be and the
  adjustment is embedded in other code imposing implementation constraints
  which aren't always appropriate.  The plan here would retain the older
  interfaces but add a new system call interface providing independent phase
  and frequency adjustments for each clock made visible at the interface, and
  allowing polled sampling of the relationship between pairs of those clocks.
  In addition, certain technical advantages are gained if the effect of clock
  adjustments is transparently predictable (a definition of this might be
  that when an adjustment is made it should be possible to precisely compute
  from post-adjustment timestamps the values of timestamps that would have
  been observed if the adjustment had not been made) so some effort will
  be given to ensuring that the arithmetic implementing an adjustment has a
  predictable effect which is accurately reported to the caller.

- I believe that a side effect of the above work will make it possible to
  provide system-call-less time of day library functions when the hardware
  underlying the system clock is accessible from user space (e.g. the CPU
  cycle counter).

- The current clock code somehow results in a clock which jitters by some
  100's of nanoseconds even when the underlying hardware is much more precise
  than this and when no adjustments are being made.  This becomes noticable
  when hardware tickstamp peripherals are available to measure it.  I haven't
  quite figured out where this comes from, but I think it has something to do
  with the manipulation of the conversion constants done at clock interrupt
  time.  The changes to the clock adjustment scheme would make tick-to-time
  conversions "tickless", (mostly) eliminating the need for clock interrupt
  maintenance of conversion data.

I have a very long, and now somewhat out of date, writeup on how I think
a clock adjustment interface should work, and why I think that.  With luck
it should be available here

    http://www.mistimed.com/home/Clock.pdf

but I can provide a summary of changes to save you the trouble of looking
at that.  Perhaps the least good thing is that it uses Yet Another Time Format
internally, and at the adjustment interface.  A systime_t is an unsigned
64 bit integer type holding a seconds-since-some-epoch value, with the
high order 32 bits being an integral seconds count and the low order 32
bits being a fractional second.  The decision to use this was arrived at
only after trying to make use of existing time formats and discovering
how much easier it was to write correct code if the time format behaves
more or less like a regular integer type, without requiring simple operations
like additions, subtractions and comparisons to be encapsulated in function
calls.

The other important type which appears at the adjustment interface is
a sysrate_t, a signed 64 bit integer type representing a fractional value
in the range [-0.5,0.5).  This is used to pass changes to the rate of advance
of a clock and to specify a slew rate and direction for slewing time offsets.
The low order bit of the representation has a precision of about 5 * 10^-20,
which is ample for all practical, and even impractical, purposes.

Internal to the kernel a timestamp passes through three states on the way
to a time of day.  A tickstamp_t is a 64 bit integer type filled in with as
raw a sample from the hardware and its associated software state (if needed)
as can be unambiguously converted to a timestamp in the not too distant future.
At conversion time clock-dependent code is called to convert the tickstamp_t
value to a tickcount_t, an unsigned 64 bit integer count of "ticks", ideally
initialized to zero when the system is booted and advancing at the rate of the
underlying counter.  How this conversion is accomplished is dependent on the
characteristics of the clock and what is convenient for the processor.  If the
hardware is a 64 bit counter and it is convenient to sample all 64 bits in the
tickstamp_t then the conversion may do nothing, if the counter is less than
64 bits then the clock dependent code may be called upon to supply the high
order bits of the tickcount_t representation, while if the hardware is other
than a free running up-counter (say, a count down interrupt timer) then the
conversion may require some arithmetic.

The conversion from a tickcount_t `tc' to a systime_t `t' is done in
clock-independent code with the following computation (all variables but
maybe 's' are unsigned 64 bit integer values):

    t = (tc << s) * r + c

Here, `s' is a constant initialized from the nominal rate of the clock and is
chosen so that bit 32 of the shifted counter value (tc << s) will change state
at least once each second.  The multiply by `r' returns the high order 64 bits
of the 128 bit product; with a minimal value of `s' at least 63 bits of `r'
will be significant.  The values of `r' and `c' are manipulated to implement
adjustments.  A step offset adjustment would be implemented by modifying the
value of `c', while a rate of advance adjustment would be made by modifying `r'
with a corresponding change to `c' to maintain phase continuity.  One additional
feature, the ability to schedule an adjustment to occur at a precise future
time, is used to implement slewed offset adjustments a la adjtime(2) (using a
pair of rate of advance adjustments, one done `now' and another restoring the
original `r' scheduled at the moment the requested offset will have
accumulated) and for leap second support (a step or slewed offset adjustment
scheduled at an appropriate time for the leap).  To enable the scheduling of
future changes, and to retain history so that tickcount_t values which are
converted late end up producing the same systime_t value that they would have
had they been converted immediately, conversion constants are stored in a
circular array of (tc, r, c) tuples, with the value of `tc' indicating the
tickcount_t value at which the constants became valid for use.  The conversion
of a `tc' hence searches for constants starting at the most recent array entry
and returning the constants from the first entry found with a `tc' no later
than the target.

Note that, with the exception of dealing with a far corner case (the overflow
of (tc << s)), no periodic maintenance of `r' and `c' is required.  The
implementation is "tickless", values of `r' and `c' only change when an
adjustment operation explicitly changes them.  Between adjustments the values
are constants, and the precision of those constants along with the direct
linear relationship between `tc' and `t' should guarantee that the jitter of
the systime_t result of the computation is identically that of the tickcount_t
value it is computed from for hardware counter frequencies less than 4 GHz.
It is the case that the conversion from tickstamp_t to tickcount_t will
likely require periodic maintenance if the tickstamp_t samples less than a
full 64 bits of counter, but in the normal case that work will be to determine
the high order bits of the timecount_t representation with the low bits still
coming directly from the hardware.  Errors made when implementing this should
hence have large and noticable effects rather than small and subtle ones.

This arrangement is not perfectly compatible with the current timecounter
code, which essentially views the function of a hardware counter to be to
measure time intervals between hardclock interrupts (the implication of the
"Timecounters tick every 10.000 msec" message), but is not inconsistant
with it.  It should be possible to provide a cookbook recipe for converting
a timecounter to the new, tickless arrangement with minimal thinking,
particularly when the underlying hardware is a conventional free-running
up-counter.  The lack of a clock interrupt synchronization point common
to all clocks will make changing the system clock source a bit more awkward,
but the visibilty of all clocks at the adjustment interface means that the
code to do a synchronize-and-switch of the system clock need not reside in
the kernel which, given that this operation is rarely required, seems
appropriate.  While the scheme here provides the necessary visibility of
IEEE 1588 counter clocks associated with Ethernet interfaces for the purposes
of adjustment and time transfer, there is additional work, not yet done, needed
to provide a framework for handling PTP tickstamps and software tickstamping
(for NTP, and for PTP when hardware support is absent) in Ethernet drivers.
The ultimate aim of the work would be a time synchronization application
implementing both NTP and PTP, managing the multiple clocks and taking
advantage of the precision of the adjustment interface to improve the quality
of system timekeeping while minimizing the cost of doing so.

The first step in the implementation of this would be to provide the arithmetic
functions needed to maintain the adjustment constants, making them available
to both the kernel and to user space applications.  In addition to the
frequently used unsigned 64x64=(high order 64 bits) multiply, a 64x64=128
bit multiply and a 128/64=64%64 bit divide are needed but much less frequently
used.  The library, described in the attached man page, also exposes the
component operations from which the required 128 bit integer functions may
be constructed in case these are useful.  A standard C implementation of the
entire library is provided, but the header files are arranged to override these
with machine-specific inline versions for any of the functions which can be
implemented by the processor with a suitably short code/instruction sequence
(all of them are inline for x86_64).  All the 64 bit processors I've looked at
have instructions to do the 128 bit multiplies, save sparc64.  For the 32 bit
machines I've had to run code on I've written assembly replacements of the 128 
bit
multiply functions since having the use of the processor carry flag provides an
advantage over the C implementation.  The 128 bit divide is less important in
terms of performance, and the C code would be difficult to improve upon in any
case, though I did an assembly 64/32=32%32 bit divide function for arm which
generally lacks hardware divide instructions altogether.

I have a prototype implementation of the clock adjustment interface and have
found it capable of producing rather excellent results at a relatively modest
cost, given a quality time source.  I have a custom clock implemented on a
PCI-X card which can synchronize itself to the PPS and frequency outputs
of a GPSDO receiver with a precision of about 3.1 ns.  If I poll the TSC-based
system clock against the card clock 4 times per second I find I can keep the
system clock synchronized to the card clock with a frequency adjustment of
roughly 10^-9 and a phase adjustment of roughly 10 ns made about once every
10 seconds (i.e. about 40 samples).  If the card polling ambiguity (+/- 7 ns)
is included that's a system clock generally kept within 20 ns of GPS with an
adjustment rate of once every 10 seconds.  Clearly an NTP-quality time source
at NTP sample rates will produce results which are much worse than this, but
maintaining a much sloppier clock has the advantage of requiring a much lower
rate of adjustment (I think maybe once every 500 seconds for NTP if the
machine's temperature is fairly constant).  This contrasts with the current
ntpd which keeps the clock accurate with an adjustment call to adjtime() every
second when the kernel code is excluded, or an adjustment every hardclock()
interrupt when the kernel code is used, and provides the (cache behaviour)
economy of shared data which is very frequently read but very seldom written.
Another advantage is that it replaces the current clock maintenance
implementation, which is essentially operates by measuring time intervals
between hardclock() interrupts and hence is inextricably attached to 
hardclock(),
with one which detaches clock-keeping functions from hardclock() altogether.  If
the timeout queue could be removed from hardclock() in favour of a tickless
implementation as well then maybe there wouldn't be much left for hardclock
to do.  Finally, it provides a structural framework for dealing with IEEE 1588
and other interesting tickstamping peripherals that is missing now.

I'm not quite sure of an implementation strategy (installing the math library
is painless but the next step might not be) but I'm leaning towards trying to
do it as a kernel compilation option, retaining the existing code, until such
time that most architectures have had support work done and it has been shown
that the new code is not worse than the old code.  At that point there could
be a switch over to the new code followed by work on things like packet
tickstamping which depend on having the newer infrastructure.

I'm not 100% confident that this will work quite as well as I think it should
in all cases, but where I've managed to try it the results have been pretty
good.  I do think that finishing the work will minimally allow the combination
of NetBSD + a Beaglebone Black (or a processor with similar peripherals) +
a GPS timing receiver board to provide the very most accurate NTP+PTP
server that $100 can buy, and I'd like to have some of those.  I realize
reasonable people might differ on some of this, however.

Dennis Ferguson
Attachment: ulmath.3
Description: Binary data
Follow-Ups:
- Re: Proposal for kernel clock changes
  - From: David Laight
Prev by Date: Re: resource leak in linux emulation?
Next by Date: Kernel panic when trying to mount non-existing file-system
Previous by Thread: asymmetric smp
Next by Thread: Re: Proposal for kernel clock changes
Indexes:
Home | Main Index | Thread Index | Old Index