Clock drift and other open issues: Collecting information

To: port-vax%NetBSD.org@localhost, Johnny Billquist <bqt%softjar.se@localhost>, Anders Magnusson <ragge%tethuvudet.se@localhost>, "Maciej W. Rozycki" <macro%orcam.me.uk@localhost>
Subject: Clock drift and other open issues: Collecting information
From: Jan-Benedict Glaw <jbglaw%lug-owl.de@localhost>
Date: Tue, 26 Dec 2023 23:29:47 +0100

Hi!

As there is an amazing amount of work being done on and for VAX these
days, I'd like to collect known issues / TODOs and debugging details.
Maybe we'd add a page to the NetBSD wiki[1]? (How do people get edit
access?) Right now, I see these major topics:

  * Document swap / ulimit requirements to have a successful local
    build, for some usual real VAX systems as well as a 512 MB
    equipped SIMH VAX.

  * Get GCC 12 up'n'running. (Untested, maybe Kalvis already has some
    patches. Whatever we find should be upstreamed! That's true also
    for the Binutils bits. I think VAX's native 64bit support didn't
    yet arrive?)

  * Get pkgsrc's current Python up'n'running. (VAX FP got removed,
    needs to be added back and maybe a maintainer needs to step up?)

  * Fix timekeeping issues.


Esp. for the timekeeping issues, I've been testing a lot with a
4000/90 (I falsely claimed this system to be a /96, but that was
wrong---my /96 has a dead Dallas clock chip and is waiting for a
repair) and a 4000/60.

  My findings so far is that both systems, bootet with a GENERIC
kernel, behave quite the same:

  * Both ntpd/ntpdate are disabled.
  * No notworking.
  * Booting off local emulated SCSI disks (PiSCSI), installed locally
    from PiSCSI-emulated install ISOs.
  * Both system loose about 2 to 4 seconds per day.
  * This loss does not change, whether
      * the system is idle; or
      * the system is CPU-loaded (running GCC in a loop); or
      * the system has I/O load (`cat`ting all regular files
        to /dev/null in a loop, with the FS being on the
        PiSCSI-emulated disk.)
  * So both, the 4000/60 and /90 have a reasonable stable time.
  * Booted with a slightly older image (Jun 6th), I see no unusual
    timekeeping-related messages; booting with a more recent image
    (g:33d45195d8dbc05843af2d76d66a83970b802c30, Fri Dec 22 17:55:49
    2023 +0000), I seem to always get _one_ of these (on both the /60
    /90) during boot:
    [     1.048131] WARNING: lwp 30 (system rt_timer) flags 0x20000000: timecounter went backwards from (1 + 0x3462e4d1a64b88fb/2^64) sec to (1 + 0x0cd08919ef941f4f/2^64) sec in netbsd:mi_switch+0x4d
    But I didn't see a simila message ever again, not while the system
    is idle, and also not while being CPU-loaded, nor with lots of I/O.
    There are about 2850 commits between Jun 6th and "today", I don't
    have a clue whether or not bisecting it down would be helpful at
    all, or if it's just a red herring...  Thinking about it, I did a
    `git blame` and found:

	36a17127078db (ad        2007-10-08 20:06:17 +0000  505) void
	949e16d902d16 (yamt      2007-12-22 01:14:53 +0000  506) updatertime(lwp_t *l, const struct bintime *now)
	f03010953f572 (yamt      2007-05-17 14:51:11 +0000  507) {
	589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000  508)        static bool backwards = false;
	f03010953f572 (yamt      2007-05-17 14:51:11 +0000  509) 
	f70325ee02948 (rmind     2009-03-28 21:43:16 +0000  510)        if (__predict_false(l->l_flag & LW_IDLE))
	f03010953f572 (yamt      2007-05-17 14:51:11 +0000  511)                return;
	f03010953f572 (yamt      2007-05-17 14:51:11 +0000  512) 
	589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000  513)        if (__predict_false(bintimecmp(now, &l->l_stime, <)) && !backwards) {
	589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000  514)                char caller[128];
	589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000  515) 
	589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000  516) #ifdef DDB
	589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000  517)                db_symstr(caller, sizeof(caller),
	589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000  518)                    (db_expr_t)(intptr_t)__builtin_return_address(0),
	589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000  519)                    DB_STGY_PROC);
	589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000  520) #else
	589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000  521)                snprintf(caller, sizeof(caller), "%p",
	589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000  522)                    __builtin_return_address(0));
	589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000  523) #endif
	589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000  524)                backwards = true;
	589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000  525)                printf("WARNING: lwp %ld (%s%s%s) flags 0x%x:"
	589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000  526)                    " timecounter went backwards"
	589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000  527)                    " from (%jd + 0x%016"PRIx64"/2^64) sec"
	589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000  528)                    " to (%jd + 0x%016"PRIx64"/2^64) sec"
	589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000  529)                    " in %s\n",
	589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000  530)                    (long)l->l_lid,
	589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000  531)                    l->l_proc->p_comm,
	589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000  532)                    l->l_name ? " " : "",
	589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000  533)                    l->l_name ? l->l_name : "",
	589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000  534)                    l->l_pflag,
	589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000  535)                    (intmax_t)l->l_stime.sec, l->l_stime.frac,
	589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000  536)                    (intmax_t)now->sec, now->frac,
	589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000  537)                    caller);
	589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000  538)        }
	589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000  539) 
	949e16d902d16 (yamt      2007-12-22 01:14:53 +0000  540)        /* rtime += now - stime */
	949e16d902d16 (yamt      2007-12-22 01:14:53 +0000  541)        bintime_add(&l->l_rtime, now);
	949e16d902d16 (yamt      2007-12-22 01:14:53 +0000  542)        bintime_sub(&l->l_rtime, &l->l_stime);
	f03010953f572 (yamt      2007-05-17 14:51:11 +0000  543) }

    Argh... So it's probably just that we now _see_ that something
    went backwards---we just didn't get informed about it
    previously...


It seems I'm unable to reproduce the timekeeping issues, at least not
with a non-networked system. I'll bring one of the two systems
downstairs and put it on wired network and start ntpdate / ntpd. I'm
highly interested in other people's statement about their setups!

  Along with other people's impressions, I really think we'd
publically collect these individual facts so that others don't need to
test the very same setp.

MfG, JBG

[1] https://wiki.netbsd.org/

--

Attachment: signature.asc
Description: PGP signature

Follow-Ups:
- re: Clock drift and other open issues: Collecting information
  - From: matthew green

Prev by Date: Re: KA630: how does console I/O actually work?
Next by Date: re: Clock drift and other open issues: Collecting information
Previous by Thread: KA630: how does console I/O actually work?
Next by Thread: re: Clock drift and other open issues: Collecting information
Indexes:

Home | Main Index | Thread Index | Old Index