I would like to rework the clock support in the kernel a bit to correct some deficiencies which exist now, and to provide new functionality. The issues I would like to try to address include: - It has become common for systems to include clocks which are unsuitable for use as the time source for the system clock but which are none-the-less useful because they are the timestamp source for (hardware) measurements of external events. The most frequently encountered example of this may be the counter included in many Ethernet MAC chips which is sampled when IEEE 1588 packets are sent and received; many systems may have more than one of these. Peripherals which hardware timestamp other types of events (e.g. signal transitions, like the PPS output of a GPS receiver) are often found in integrated SoCs, as are devices which use a free-running counter to generate events. Making all of these "measurement" clocks useful to the system seems to require two things. It first requires that each of these clocks be visible to and indepently adjustable via a clock adjustment interface. The only thing one can do with accurate-in-time external events which are measured with a particular clock is to use the information to adjust that particular clock into synchronization, so each such clock must be independently adjustable. The second requirement is that it must be possible to measure the times of independent pairs of clocks in the system against each other as precisely as possible, perhaps with a sequence of the form read clock A read clock B read clock A to provide an estimate of both the offset between the clocks and the uncertainty/ambiguity of the measurement itself. The reason for this is that having a precisely synchronized measurement clock in, say, an Ethernet MAC chip is clearly fairly useless by itself. Its time becomes useful only when it can be transferred to the system clock and/or other clocks in need of synchronization so that other applications can use it too. - Acquiring a timestamp from a clock is generally done by (1) reading a value from a hardware register, then (2) manipulating the value and doing the arithmetic necessary to convert it to a time of day. I would like to be able to separate (1) from (2), storing the raw output from the hardware now (I've been calling this a "tickstamp") but deferring the work of converting it to a meaningful timestamp until a bit later. An example use of this might be to tickstamp every packet arrival early in a network controller's interrupt routine but only go to the trouble of converting the tickstamp to a timestamp if the packet arrives somewhere which cares about this (e.g. an NTP or PTP socket), with unused tickstamps perhaps being donated to the random number pool. Event timestamping has many uses, and with a suitably inexpensive hardware time source capturing event times in tickstamp form has many advantages. It minimizes the overhead in the case that not all tickstamps are consumed as timestamps, it often allows the bulk of the work of acquiring a timestamp to be moved from time critical code to code with more relaxed constraints, and is probably the most appropriate way to provide random number pool entropy. The clock-pair polling above might be implemented as read clock A tickstamp read clock B tickstamp read clock A tickstamp with the corresponding timestamps being sorted out after the time-critical polling code segment has been completed. It might also be possible to provide arithmetic functions computing nominal time intervals directly from tickstamps themselves to make the implementation of things like Codel's packet timestamping more economical. - The fundamental problem that clock synchronization software needs to deal with is that the oscillator (likely a crystal soldered to a board somewhere) driving the clock the software is trying to keep correct makes errors: the output frequency of the oscillator usually differs considerably from the number written on its package and will measurably vary with time. The operating system will likely be setting the rate of advance of the digital clock being driven by the oscillator based on the number written on the oscillator's package, the job of clock synchronization software is to measure the actual frequency of the oscillator and to update the clock's rate of advance to match. The traditional BSD clock adjustment interface, adjtime(2), provides only a slewing phase (i.e. time) adjustment. There is no way to directly alter the clock's rate of advance to reduce the need for phase adjustments, nor is the adjtime(2) slew rate adjustable, and by modern standards the precision of the phase adjustments it implements is quite limited. The optional NTP adjustment interface does provide a frequency adjustment, but the resolution is lower than the measurement precision can be and the adjustment is embedded in other code imposing implementation constraints which aren't always appropriate. The plan here would retain the older interfaces but add a new system call interface providing independent phase and frequency adjustments for each clock made visible at the interface, and allowing polled sampling of the relationship between pairs of those clocks. In addition, certain technical advantages are gained if the effect of clock adjustments is transparently predictable (a definition of this might be that when an adjustment is made it should be possible to precisely compute from post-adjustment timestamps the values of timestamps that would have been observed if the adjustment had not been made) so some effort will be given to ensuring that the arithmetic implementing an adjustment has a predictable effect which is accurately reported to the caller. - I believe that a side effect of the above work will make it possible to provide system-call-less time of day library functions when the hardware underlying the system clock is accessible from user space (e.g. the CPU cycle counter). - The current clock code somehow results in a clock which jitters by some 100's of nanoseconds even when the underlying hardware is much more precise than this and when no adjustments are being made. This becomes noticable when hardware tickstamp peripherals are available to measure it. I haven't quite figured out where this comes from, but I think it has something to do with the manipulation of the conversion constants done at clock interrupt time. The changes to the clock adjustment scheme would make tick-to-time conversions "tickless", (mostly) eliminating the need for clock interrupt maintenance of conversion data. I have a very long, and now somewhat out of date, writeup on how I think a clock adjustment interface should work, and why I think that. With luck it should be available here http://www.mistimed.com/home/Clock.pdf but I can provide a summary of changes to save you the trouble of looking at that. Perhaps the least good thing is that it uses Yet Another Time Format internally, and at the adjustment interface. A systime_t is an unsigned 64 bit integer type holding a seconds-since-some-epoch value, with the high order 32 bits being an integral seconds count and the low order 32 bits being a fractional second. The decision to use this was arrived at only after trying to make use of existing time formats and discovering how much easier it was to write correct code if the time format behaves more or less like a regular integer type, without requiring simple operations like additions, subtractions and comparisons to be encapsulated in function calls. The other important type which appears at the adjustment interface is a sysrate_t, a signed 64 bit integer type representing a fractional value in the range [-0.5,0.5). This is used to pass changes to the rate of advance of a clock and to specify a slew rate and direction for slewing time offsets. The low order bit of the representation has a precision of about 5 * 10^-20, which is ample for all practical, and even impractical, purposes. Internal to the kernel a timestamp passes through three states on the way to a time of day. A tickstamp_t is a 64 bit integer type filled in with as raw a sample from the hardware and its associated software state (if needed) as can be unambiguously converted to a timestamp in the not too distant future. At conversion time clock-dependent code is called to convert the tickstamp_t value to a tickcount_t, an unsigned 64 bit integer count of "ticks", ideally initialized to zero when the system is booted and advancing at the rate of the underlying counter. How this conversion is accomplished is dependent on the characteristics of the clock and what is convenient for the processor. If the hardware is a 64 bit counter and it is convenient to sample all 64 bits in the tickstamp_t then the conversion may do nothing, if the counter is less than 64 bits then the clock dependent code may be called upon to supply the high order bits of the tickcount_t representation, while if the hardware is other than a free running up-counter (say, a count down interrupt timer) then the conversion may require some arithmetic. The conversion from a tickcount_t `tc' to a systime_t `t' is done in clock-independent code with the following computation (all variables but maybe 's' are unsigned 64 bit integer values): t = (tc << s) * r + c Here, `s' is a constant initialized from the nominal rate of the clock and is chosen so that bit 32 of the shifted counter value (tc << s) will change state at least once each second. The multiply by `r' returns the high order 64 bits of the 128 bit product; with a minimal value of `s' at least 63 bits of `r' will be significant. The values of `r' and `c' are manipulated to implement adjustments. A step offset adjustment would be implemented by modifying the value of `c', while a rate of advance adjustment would be made by modifying `r' with a corresponding change to `c' to maintain phase continuity. One additional feature, the ability to schedule an adjustment to occur at a precise future time, is used to implement slewed offset adjustments a la adjtime(2) (using a pair of rate of advance adjustments, one done `now' and another restoring the original `r' scheduled at the moment the requested offset will have accumulated) and for leap second support (a step or slewed offset adjustment scheduled at an appropriate time for the leap). To enable the scheduling of future changes, and to retain history so that tickcount_t values which are converted late end up producing the same systime_t value that they would have had they been converted immediately, conversion constants are stored in a circular array of (tc, r, c) tuples, with the value of `tc' indicating the tickcount_t value at which the constants became valid for use. The conversion of a `tc' hence searches for constants starting at the most recent array entry and returning the constants from the first entry found with a `tc' no later than the target. Note that, with the exception of dealing with a far corner case (the overflow of (tc << s)), no periodic maintenance of `r' and `c' is required. The implementation is "tickless", values of `r' and `c' only change when an adjustment operation explicitly changes them. Between adjustments the values are constants, and the precision of those constants along with the direct linear relationship between `tc' and `t' should guarantee that the jitter of the systime_t result of the computation is identically that of the tickcount_t value it is computed from for hardware counter frequencies less than 4 GHz. It is the case that the conversion from tickstamp_t to tickcount_t will likely require periodic maintenance if the tickstamp_t samples less than a full 64 bits of counter, but in the normal case that work will be to determine the high order bits of the timecount_t representation with the low bits still coming directly from the hardware. Errors made when implementing this should hence have large and noticable effects rather than small and subtle ones. This arrangement is not perfectly compatible with the current timecounter code, which essentially views the function of a hardware counter to be to measure time intervals between hardclock interrupts (the implication of the "Timecounters tick every 10.000 msec" message), but is not inconsistant with it. It should be possible to provide a cookbook recipe for converting a timecounter to the new, tickless arrangement with minimal thinking, particularly when the underlying hardware is a conventional free-running up-counter. The lack of a clock interrupt synchronization point common to all clocks will make changing the system clock source a bit more awkward, but the visibilty of all clocks at the adjustment interface means that the code to do a synchronize-and-switch of the system clock need not reside in the kernel which, given that this operation is rarely required, seems appropriate. While the scheme here provides the necessary visibility of IEEE 1588 counter clocks associated with Ethernet interfaces for the purposes of adjustment and time transfer, there is additional work, not yet done, needed to provide a framework for handling PTP tickstamps and software tickstamping (for NTP, and for PTP when hardware support is absent) in Ethernet drivers. The ultimate aim of the work would be a time synchronization application implementing both NTP and PTP, managing the multiple clocks and taking advantage of the precision of the adjustment interface to improve the quality of system timekeeping while minimizing the cost of doing so. The first step in the implementation of this would be to provide the arithmetic functions needed to maintain the adjustment constants, making them available to both the kernel and to user space applications. In addition to the frequently used unsigned 64x64=(high order 64 bits) multiply, a 64x64=128 bit multiply and a 128/64=64%64 bit divide are needed but much less frequently used. The library, described in the attached man page, also exposes the component operations from which the required 128 bit integer functions may be constructed in case these are useful. A standard C implementation of the entire library is provided, but the header files are arranged to override these with machine-specific inline versions for any of the functions which can be implemented by the processor with a suitably short code/instruction sequence (all of them are inline for x86_64). All the 64 bit processors I've looked at have instructions to do the 128 bit multiplies, save sparc64. For the 32 bit machines I've had to run code on I've written assembly replacements of the 128 bit multiply functions since having the use of the processor carry flag provides an advantage over the C implementation. The 128 bit divide is less important in terms of performance, and the C code would be difficult to improve upon in any case, though I did an assembly 64/32=32%32 bit divide function for arm which generally lacks hardware divide instructions altogether. I have a prototype implementation of the clock adjustment interface and have found it capable of producing rather excellent results at a relatively modest cost, given a quality time source. I have a custom clock implemented on a PCI-X card which can synchronize itself to the PPS and frequency outputs of a GPSDO receiver with a precision of about 3.1 ns. If I poll the TSC-based system clock against the card clock 4 times per second I find I can keep the system clock synchronized to the card clock with a frequency adjustment of roughly 10^-9 and a phase adjustment of roughly 10 ns made about once every 10 seconds (i.e. about 40 samples). If the card polling ambiguity (+/- 7 ns) is included that's a system clock generally kept within 20 ns of GPS with an adjustment rate of once every 10 seconds. Clearly an NTP-quality time source at NTP sample rates will produce results which are much worse than this, but maintaining a much sloppier clock has the advantage of requiring a much lower rate of adjustment (I think maybe once every 500 seconds for NTP if the machine's temperature is fairly constant). This contrasts with the current ntpd which keeps the clock accurate with an adjustment call to adjtime() every second when the kernel code is excluded, or an adjustment every hardclock() interrupt when the kernel code is used, and provides the (cache behaviour) economy of shared data which is very frequently read but very seldom written. Another advantage is that it replaces the current clock maintenance implementation, which is essentially operates by measuring time intervals between hardclock() interrupts and hence is inextricably attached to hardclock(), with one which detaches clock-keeping functions from hardclock() altogether. If the timeout queue could be removed from hardclock() in favour of a tickless implementation as well then maybe there wouldn't be much left for hardclock to do. Finally, it provides a structural framework for dealing with IEEE 1588 and other interesting tickstamping peripherals that is missing now. I'm not quite sure of an implementation strategy (installing the math library is painless but the next step might not be) but I'm leaning towards trying to do it as a kernel compilation option, retaining the existing code, until such time that most architectures have had support work done and it has been shown that the new code is not worse than the old code. At that point there could be a switch over to the new code followed by work on things like packet tickstamping which depend on having the newer infrastructure. I'm not 100% confident that this will work quite as well as I think it should in all cases, but where I've managed to try it the results have been pretty good. I do think that finishing the work will minimally allow the combination of NetBSD + a Beaglebone Black (or a processor with similar peripherals) + a GPS timing receiver board to provide the very most accurate NTP+PTP server that $100 can buy, and I'd like to have some of those. I realize reasonable people might differ on some of this, however. Dennis Ferguson
Attachment:
ulmath.3
Description: Binary data