Port-vax archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: PSA: Clock drift and pkgin



On 2023-12-18 15:33, Maciej W. Rozycki wrote:
On Fri, 15 Dec 2023, Johnny Billquist wrote:

983136 is pretty close to 1000000. However, without looking at the code,
isn't
that the diagnostics timecounter? Now used for anything really related to
time
keeping, but just for some other information, like sampling the state of
the
cpu and so on?

   It's just a free-running counter, as good as any.  The KA46 has no ICR.

Sortof. ICR and NICR are not required to exist. A machine is allowed to have a
subset implementation of ICCS, only capable of generating an interrupt every
10 ms with no further control. Which is still the normal clock used as a
source for time in the OS, if I'm not completely confused.

  As long as the ICR is used (or no high-resolution timer is available at
all) using timer interrupt as the system clock source is the correct
approach.  The thing is the ICR is synchronous to the timer interrupt, and
moreover it is not a free-running counter as it's reinitialised every time
an interrupt is produced.  So using the counter of interrupt ticks as the
high-order bits of the timekeeping timer is the only way you can produce a
monotonic counter.

Yes. But (and) the thing is - the ICCS is *always* available. And is always used. But when we don't have the ICR, the value from reading out the clock becomes tricky. Because we are still using ICCS as the source of clock interrupts that drive the system wall clock. But then we don't have ICR as a source of information on how long time have passed since the last clock interrupt. Basically, when we read out time, we call getticks(), and then add the normalized current value of ICR, as current time. So if ICR is 0 all the time, we would basically just have a time that is getticks() with nothing more, with a resolution of 10ms. But it should for sure be monotonically increasing.

For the KA46 (and *only* the KA46), we are using some other mechanism, which I haven't really dug into, to get some higher precision time when reading out time. But let's ignore that platform for the moment. We have people with various machines, and simulations, which have the time problem. And most are not KA46. As far as I see, KA46 is merely the 4000/60.

  The drawback is that if you ever lose even a single timer interrupt, then
you lose track of the wall clock too.

Certainly. Which is why the question of lost interrupts were brought up.
But it is definitely the case that this is how time is tracked. Definitely on VAX. I would think for all other platforms as well, but I haven't looked at them.

So in the end, what we have is that for most machines, we're getting a higher
resolution clock based on ICR. We basically have the clock tick, which gives
something at 10 ms steps, and then we add in what ICR is at the moment. For
CPUs that don't have an ICR, the clock will just be at the 10 ms resolution
and that's it.
With the exception of the VAX_BTYP_46, which uses another source, that is.

  And we do want to use such another source where no ICR is available as
10ms resolution is pretty horrible for the purpose of timekeeping.  In
that case the other source is not synchronous to the timer interrupt and
therefore the OS ought not use the timer interrupt as the system clock
source.  Instead it should use the approach I outlined previously, that is
use the high-precision timer as the system clock source and only use the
timer interrupt to keep track of timer overflows.

I should point out that even when we don't have the ICR register, we are running with a 10ms precision clock as far as interrupts are concerned. And that is where the system clock source comes from.

Basically, we have an interrupt vector at C0 (IPL 16), which points at hardclock. All in arch/vax/vax/intvec.S.

The hardclock routine in turn calls the C function hardclock(), which is in kern/kern_clock.c, which is expected to be called HZ times per second, and which deals with the wall clock, if it's happening on the primary cpu (hardclock_ticks).

Note that if anyone calls getticks() in the kernel, they will get the hardclock_ticks value, which is basically just the counter of interrupts calling hardclock().

  Apart from providing correct time (which will not be the case in this
scenario if you try to treat the counter of timer interrupts as the
high-order bits of the timekeeping timer), the advantage of this approach
is that as long as at least one timer interrupt has been handled between
high-precision timer overflows no track loss of the wall clock will happen
(of course we're not supposed to lose timer interrupts anyway, but the
consequences of missing a preemptive context switch are certainly less
severe in a non-RTOS than getting out of sync with time).

Well. Maybe part of the problem is that VAX is actually using a clock interrupt for counting time. I wasn't even aware that NetBSD could run in tickless mode. There is a lot of things that usually are driven by the ticks. Not just preemptive context switching.

  The KA46 hardware configuration is analogous to the KN03/3MAX+ machine,
where the source of the timer interrupt is the DS1287 RTC chip and the
high-precision timer is located in the TURBOchannel bridge chip.  This is
handled as "turbochannel_counter" in sys/arch/pmax/pmax/dec_3maxplus.c,
and the KA46 variant ought to work essentially the same.  It is actually
the 3MAX+ machine that David L. Mills used to implement his NTP framework.

The DS1287 should never be a source of any precision time as far as I know. It has a resolution of 1s. It's usually used as the calendar chip, from which you set the wall clock on boot, but otherwise never usually bother with.

Yes, it can generate interrupts as well, with a fairly high frequency, but I can't see a way of reading out any high precision time from it.

But anyway - our timing problems are clearly a case on machines with no DS1287, and with ICR, as well as all other combinations. And even the 4000/60 is using getticks() sourced from the ICCS register as the starting point, and then it just uses some other information to get some more precision, since the ICR don't exist. (We should probably look at extending that to more machines, because if the 4000/60 don't have this, then it's likely that the same is true for all 4000 machines...)

  NB I disagree that 983136Hz is pretty close to 1000000Hz.  The frequency
difference implies a ~1 second drift per 1 minute, which I find pretty
horrible by any measure.

:-)
I said that a little with tounge in cheek. But also, I'm not entirely sure how the value is used. I can see some computations on the KA46 to work out a high precision time, which are not simple copies of values. So if there is some scaling going on that is included one way or another on some values here I'm not sure. But as I said, I'm not even going to sort this one out right now. Keeping it simple, and starting with machines that don't even deal with that hardware. We still have something seriously wrong on hardware that should not it seems (but I really should check that simh isn't doing the ICR wrong).

  Johnny

--
Johnny Billquist                  || "I'm on a bus
                                  ||  on a psychedelic trip
email: bqt%softjar.se@localhost             ||  Reading murder books
pdp is alive!                     ||  tryin' to stay hip" - B. Idol


Home | Main Index | Thread Index | Old Index