Subject: Re: "frequency error ... exceeeds tolerance"
To: der Mouse <mouse@Rodents.Montreal.QC.CA>
From: Greg Troxel <gdt@ir.bbn.com>
List: port-alpha
Date: 08/21/2007 12:11:42
  [Greg Troxel]
  > The real question is whether the clock is consistently that slow
  > (actually fast - I think that's the correction rate), or badly
  > behaved.

  That's a good question.  The number is always consistently in the range
  500-512, except its sign flips back and forth (which I hadn't noticed
  until just now).  To me, this indicates that the clock is very badly
  behaved, and just sometimes happens to misbehave badly enough to pass
  the NTP limit one way or the other.  Does that sound like a correct
  interpretation?

Maybe, but things are messy enough that I'd be wary of any conclusion.
NTP on the wire has various fixed-point formats, designed to be big
enough for the need.  The kernel pll has the same mentality.  I wouldn't
be all that surprised if something were wrapping.  500 really is wacky -
normally even 100 is bad.  See /usr/include/sys/timex.h.

I'd run /usr/sbin/ntptime and see what that says.

  > I would suggest upping the limit and letting it stabilize.

  What does "upping the limit" mean here?  Rebuilding NTP with
  NTP_MAXFREQ set higher?  Rebuilding the kernel with a higher adjustment
  slew rate?

I meant to change the threshold for the 'too far out of bounds' if test,
to let the algorithm run.  But I'm not so sure that's a good idea
because as you mentioned it runs pretty deep in the kernel.

The other experiment I'd try would be to not run ntpd on the machine and
run something (ntptrace will work, albeit kludgily) to measure the
offset to another machine periodically.

I dimly recall some bug on some architecture, maybe even alpha, 10 years
ago or so, where the clock code was just off, in a 1023/1024 kind of
way.