Port-vax archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: PSA: Clock drift and pkgin



On 2023-12-18 18:50, Maciej W. Rozycki wrote:
On Mon, 18 Dec 2023, Johnny Billquist wrote:

   As long as the ICR is used (or no high-resolution timer is available at
all) using timer interrupt as the system clock source is the correct
approach.  The thing is the ICR is synchronous to the timer interrupt, and
moreover it is not a free-running counter as it's reinitialised every time
an interrupt is produced.  So using the counter of interrupt ticks as the
high-order bits of the timekeeping timer is the only way you can produce a
monotonic counter.

Yes. But (and) the thing is - the ICCS is *always* available. And is always
used.
But when we don't have the ICR, the value from reading out the clock becomes
tricky. Because we are still using ICCS as the source of clock interrupts that
drive the system wall clock. But then we don't have ICR as a source of
information on how long time have passed since the last clock interrupt.

  Which is exactly why I keep writing we ought not be doing this (if we do)
and should rely solely on the high-resolution timer for timekeeping.

No real disagreement there, except if you mean something like the DS1287 then I'm not sure I agree on the definition of "high resolution timer". The ICCS in combination with ICR gives a resolution of 1us, which obviously is way better than the RTC chip. In addition, with the RTC chip, you really depend on no missed interrupts, while interrupts happen at a pretty high frequency, making the chances bigger that an interrupt could be missed.

But if we don't have ICR, then I agree that things are a bit depressing, as a 10ms resolution isn't exactly impressive. Again, the 4000/60 is doing something else/more to get a higher resolution. So clearly there is an attempt at doing something better in this case.
But for machines that do have the ICR, that seems to me to be superior.

For the KA46 (and *only* the KA46), we are using some other mechanism, which I
haven't really dug into, to get some higher precision time when reading out
time. But let's ignore that platform for the moment. We have people with
various machines, and simulations, which have the time problem. And most are
not KA46. As far as I see, KA46 is merely the 4000/60.

  Conversely what I have been concerned with is incorrect operation with
actual hardware, and then this specific one.

Did anyone actually report results running on a 4000/60? Personally, I've been running on a 4000/90, where time is not working well. But that one would be depending on the ICR. But I haven't checked if it actually do have a proper ICR. Does anyone know?

  As it has been mentioned in the discussion already getting timekeeping
right in simulation is tricky.  It is best handled by referring to the
host system clock where feasible, e.g. any hardware counters are best
evaluated at access time only and calculated based on the host clock and
the rate they are expected to change.

Indeed. And I do not feel totally confident that simh is working right. But the fact is that real hardware are also having problems keeping time with NetBSD, so it seems clear there is a real problem here anyway.
But this needs more testing/troubleshooting/diagnosing.

   And we do want to use such another source where no ICR is available as
10ms resolution is pretty horrible for the purpose of timekeeping.  In
that case the other source is not synchronous to the timer interrupt and
therefore the OS ought not use the timer interrupt as the system clock
source.  Instead it should use the approach I outlined previously, that is
use the high-precision timer as the system clock source and only use the
timer interrupt to keep track of timer overflows.

I should point out that even when we don't have the ICR register, we are
running with a 10ms precision clock as far as interrupts are concerned. And
that is where the system clock source comes from.

  No question as to using the ICCS for timer interrupts, but I maintain
that where a separate high-resolution timer is available, it is the
high-resolution timer that ought to be used for the system clock.  And I
suspect it is already the case as AFAICT our clock handling subsystem is
generic across ports (as it ought to be too).

For most any VAX, the high resolution time comes from reading the ICR. Looking at the code, it seems in general to be reasonable, but I do have some questions about it. But clearly machines are having issues, which leads to the suspicion that something in broken in that bit of code anyway.

  Yes, I realise timer interrupts have their use beyond just scheduling
(e.g. handling interval timer triggers, poll(2) timeouts, etc.), but none
of this stuff is supposed to rely on exact timing.

I was just pointing out that the timer interrupt is essential, as such.
However, the time problems we are observing means that something is not done/working right. No matter what/why, it is something that needs fixing.

The DS1287 should never be a source of any precision time as far as I know. It
has a resolution of 1s. It's usually used as the calendar chip, from which you
set the wall clock on boot, but otherwise never usually bother with.

Yes, it can generate interrupts as well, with a fairly high frequency, but I
can't see a way of reading out any high precision time from it.

  For some machines, such as the KN01/PMAX or KN02/3MAX, the DS1287 chip is
the only timer interrupt source, and the resolution it can be programmed
to is up to 8192Hz/122us, much better than bare ICCS without ICR, though
with these slow machines 128Hz/7.812ms is typically used (for Alpha boxes
1024Hz/977us seems more suitable, for more fine-grained task scheduling
while still not wasting proportionally as much time in interrupt handling
overhead).  I've spent a lot of time working with these devices.

I can agree that it's better than a VAX with only ICCS and no ICR. But it certainly suffers the same problem/risk that lost interrupts will cause clock drift. But I also think that ICCS in combination with ICR is better than that clock chip. But that don't help much, since we apparently seem to have some kind of issue anyway.

So, from my point of view, we don't really need to argue the details of the implementation much. Unless you really want to, I don't think that's the main issue right now. What matters is that something is wrong, and it has been wrong for a long time. I've not really paid much attention to it, but I'm pretty sure it's been this bad for at least 15 years, if not more.

  Johnny

--
Johnny Billquist                  || "I'm on a bus
                                  ||  on a psychedelic trip
email: bqt%softjar.se@localhost             ||  Reading murder books
pdp is alive!                     ||  tryin' to stay hip" - B. Idol


Home | Main Index | Thread Index | Old Index