port-alpha: "frequency error ... exceeeds tolerance" again

Subject: "frequency error ... exceeeds tolerance" again
To: None <port-alpha@NetBSD.org>
From: der Mouse <mouse@Rodents.Montreal.QC.CA>
List: port-alpha
Date: 08/27/2007 23:34:40
This PWS's NTP is still misbehaving.  I don't know what to try next.
(But there's some very interesting stuff going on here; see the part at
the end about what happened after I killed off ntpd....)

I tried setting HZ to 1024.  I tried redefining CLKF_BASEPRI to 0.  I
tried both of them together.

Nothing seems to work.  The latest run - with both of those changes -
still misbehaved; you can find the self-monitoring log, which consists
of once-a-minute runs of

date "++%%s=%s %F %T"; ntpdc -c kerninfo; ntptime; ntpdate -q 10.0.1.1; echo ""

(10.0.1.1 being the broadcast time provider on that subnet) in
ftp.rodents.montreal.qc.ca:/mouse/misc/ntp-log.txt.  I processed it
with

awk '/^[+]%s=/ { t = substr($1,5); } /^  offset/ { o = $2; f = $5; } /^  maximum error/ { e = $7; } /^$/ { print t " " o " " f " " e; } { next; }'

to pick out three of the most interesting values for each data sample
(the offset, the frequency error, and the "estimated error") and
plotted them with gnuplot, and it ticked along fine for a while and
then went wonky, with all three values jumping all over the map for
hours.

Interestingly, the estimated error started weirding out well before
either of the others did, as you can see in a plot of all three values.
(You probably want to multiply the frequency error by 100 or so if
plotting on the same scale as the others; in gnuplot, I do `f(x)=x*100'
and then `plot "z" using 1:2 with lines, "z" thru f(x) using 1:3 with
lines, "z" using 1:4 with lines'.)

So, I killed off ntpd and started running ntpdate -B instead.

Interestingly, the measured offset *still* jumps all over the place!
Look at these runs, which were all done within the same minute or so:

[Omega - root] 39> ntpdate -B 10.0.1.1
28 Aug 00:04:26 ntpdate[25214]: adjust time server 10.0.1.1 offset 0.198104 sec
[Omega - root] 40> ntpdate -B 10.0.1.1
28 Aug 00:04:34 ntpdate[24674]: adjust time server 10.0.1.1 offset -0.020164 sec
[Omega - root] 41> ntpdate -B 10.0.1.1
28 Aug 00:04:38 ntpdate[926]: adjust time server 10.0.1.1 offset -0.007026 sec
[Omega - root] 42> ntpdate -B 10.0.1.1
28 Aug 00:04:41 ntpdate[24736]: adjust time server 10.0.1.1 offset 0.002804 sec
[Omega - root] 43> ntpdate -B 10.0.1.1
28 Aug 00:04:43 ntpdate[24739]: adjust time server 10.0.1.1 offset -0.001927 sec
[Omega - root] 44> ntpdate -B 10.0.1.1
28 Aug 00:04:46 ntpdate[2847]: adjust time server 10.0.1.1 offset 0.001993 sec
[Omega - root] 45> ntpdate -B 10.0.1.1
28 Aug 00:04:47 ntpdate[2869]: adjust time server 10.0.1.1 offset 0.290963 sec
[Omega - root] 46> 

So unless 10.0.1.1 itself has weirded out - which is unlikely; it
serves time to the rest of my house LAN as well, and no other machine
has noticed anything, including three others, also on this same subnet,
running 3.1 - then *something* is rather seriously busted on alpha.  At
this point I'm wondering if maybe it's in the NTP code rather than the
kernel, but I'm really not competent to more than speculate on that.

/~\ The ASCII				der Mouse
\ / Ribbon Campaign
 X  Against HTML	       mouse@rodents.montreal.qc.ca
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B