Port-vax archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: PSA: Clock drift and pkgin



>> How well can we trust your emulator to be fielding the correct
>> number of interrupts?
> There are lots of places there that ticks could get lost.  That's why
> I want to instrument everything.

I looked at adding this, only to discover I'd already done so.

Teal deer: I need to dig into this more.  Everything is puzzling.

The code has a way I can have it generate a log line once a second
which records the time (according to the host OS) and all five timer
stats.  Using internal variable names, they are ("lost in" is explained
below):

timer_ticks
	The number of times the main process has processed info pushed
	up from the auxiliary process.  (In the total absence of loss,
	this will increase by 100 per second.)
timer_loss
	Ticks lost in the auxiliary -> master channel.
timer_loss_2
	Ticks lost in the kernel -> auxiliary channel.
iccs_intrs
	The number of ICCS interrupts taken by the emulated CPU.
iccs_loss
	Ticks lost in ICCS interrupts.

"Lost in" refers to something happening and then happening again before
the first one is handled.  This can happen at three places: the kernel
pushing ticks to the auxiliary process; the auxiliary process pushing
ticks to the master process; and the emulated machine taking ICCS
interrupts.

I tested it both under no load (up single-user, sitting at the shell
prompt) and under load (my test load was "fsck -f /dev/rra0a" in the
single-user shell).

All three are losing ticks.  Over 372.754570 seconds of testing, with
about the first 1/6 and the last 1/6 sitting idle and the middle 2/3
running fsck:

- timer_loss increases at a rate which is irregular in the fine but
   relatively steady in the gross.  Total loss 478 ticks.

- timer_loss_2 increases similarly, but during about the last 60% of
   the fsck run, it was losing less than at other times.  Total loss
   211 ticks.

- iccs_loss was constant when idle but lost steadily when loaded.
   Total loss 229 ticks; about 2/3 of the test time was loaded.

All three of those surprise me.

Furthermore, looking at the logged lines, I see (to pick the second and
second-last lines)

STAMP: 2023-12-14 07:51:51.439154 (tick 101 (2 0), intr 0 (0))
STAMP: 2023-12-14 07:58:02.193888 (tick 36701 (477 210), intr 36187 (229))

which indicate that it registered 36600 ticks in 370.754734 seconds,
meaning it lost about 1.28% of its ticks.  This is _very_ disturbing;
timer sockets are not supposed to do that.

I did another run, using SIGALRM instead of timer sockets as the timing
source.  This is not usable for general-purpose use, because 5.2
SIGLARM is slightly broken; requesting SIGALRM at 100Hz actually
delivers tham at 50Hz (kernel HZ is 100).  I haven't dug into why that
is - for most of my purposes, timer sockets are a more useful
interface.

Except for a startup transient, this form loses ticks but doesn't
notice it.  Timekeeping was running at half speed, because of the 50Hz
SIGALRM issue, but aside from that it was still losing time.

At 08:42:07.503420 it had seen 101 ticks and lost 469 (timer_loss) + 1
(iccs_loss).  timer_loss_2 is always zero in this mode, and timer_loss
instead means the number of SIGALRMs that got delivered by the kernel
but didn't get handled by the emulator before the next SIGALRM (see
below for details).  I count the 469 as a startup transient.

At 08:48:09.273952, after roughly the same load pattern as for the
other test, it had seen 18101 ticks and lost 470 + 1.

But, after 18000 ticks (the difference between those two lines), it
should have taken precisely three minutes - or six minutes because it's
running at half speed.  It didn't.  It took six minutes 1.770532
seconds.  So it's losing ticks somewhere it can't notice.  The only
place I can think of it could be is in SIGALRM delivery.  But it never
blocks SIGALRM, and the handler is

volatile unsigned int got_alrm;
volatile int attention;

static void timer_tick(int sig __attribute__((__unused__)))
{
 got_alrm ++;
 attention = 1;
}

which is handled in step1(), which is "execute one instruction":

unsigned int prev_alrm;

static void step1(void)
{
 ...variable declarations...
 if (attention)
  { attention = 0;
    if (timerstyle == TS_SIGALRM)
     { int v1;
       int v2;
       v2 = got_alrm;
       do
        { v1 = v2;
          v2 = got_alrm;
        } while (v1 != v2);
       if (v1 != prev_alrm)
        { timer_loss += (v1 - prev_alrm) - 1;
          timer_ticks ++;
          handle_tick();
          prev_alrm = v1;
        }
     }
    ...check other things that can set attention...
  }
 ...execute one instruction...
}

(The loop when reading got_alrm is paranoia; I think on all relevant
hardware writing an unsigned int is atomic.)

I'm clearly going to need to dig more.

I also should make v1 and v2 unsigned int, but that doesn't explain
this; wraparound won't happen there unless a single run lasts multiple
months.

/~\ The ASCII				  Mouse
\ / Ribbon Campaign
 X  Against HTML		mouse%rodents-montreal.org@localhost
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Home | Main Index | Thread Index | Old Index