Hi!
As there is an amazing amount of work being done on and for VAX these
days, I'd like to collect known issues / TODOs and debugging details.
Maybe we'd add a page to the NetBSD wiki[1]? (How do people get edit
access?) Right now, I see these major topics:
* Document swap / ulimit requirements to have a successful local
build, for some usual real VAX systems as well as a 512 MB
equipped SIMH VAX.
* Get GCC 12 up'n'running. (Untested, maybe Kalvis already has some
patches. Whatever we find should be upstreamed! That's true also
for the Binutils bits. I think VAX's native 64bit support didn't
yet arrive?)
* Get pkgsrc's current Python up'n'running. (VAX FP got removed,
needs to be added back and maybe a maintainer needs to step up?)
* Fix timekeeping issues.
Esp. for the timekeeping issues, I've been testing a lot with a
4000/90 (I falsely claimed this system to be a /96, but that was
wrong---my /96 has a dead Dallas clock chip and is waiting for a
repair) and a 4000/60.
My findings so far is that both systems, bootet with a GENERIC
kernel, behave quite the same:
* Both ntpd/ntpdate are disabled.
* No notworking.
* Booting off local emulated SCSI disks (PiSCSI), installed locally
from PiSCSI-emulated install ISOs.
* Both system loose about 2 to 4 seconds per day.
* This loss does not change, whether
* the system is idle; or
* the system is CPU-loaded (running GCC in a loop); or
* the system has I/O load (`cat`ting all regular files
to /dev/null in a loop, with the FS being on the
PiSCSI-emulated disk.)
* So both, the 4000/60 and /90 have a reasonable stable time.
* Booted with a slightly older image (Jun 6th), I see no unusual
timekeeping-related messages; booting with a more recent image
(g:33d45195d8dbc05843af2d76d66a83970b802c30, Fri Dec 22 17:55:49
2023 +0000), I seem to always get _one_ of these (on both the /60
/90) during boot:
[ 1.048131] WARNING: lwp 30 (system rt_timer) flags 0x20000000: timecounter went backwards from (1 + 0x3462e4d1a64b88fb/2^64) sec to (1 + 0x0cd08919ef941f4f/2^64) sec in netbsd:mi_switch+0x4d
But I didn't see a simila message ever again, not while the system
is idle, and also not while being CPU-loaded, nor with lots of I/O.
There are about 2850 commits between Jun 6th and "today", I don't
have a clue whether or not bisecting it down would be helpful at
all, or if it's just a red herring... Thinking about it, I did a
`git blame` and found:
36a17127078db (ad 2007-10-08 20:06:17 +0000 505) void
949e16d902d16 (yamt 2007-12-22 01:14:53 +0000 506) updatertime(lwp_t *l, const struct bintime *now)
f03010953f572 (yamt 2007-05-17 14:51:11 +0000 507) {
589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000 508) static bool backwards = false;
f03010953f572 (yamt 2007-05-17 14:51:11 +0000 509)
f70325ee02948 (rmind 2009-03-28 21:43:16 +0000 510) if (__predict_false(l->l_flag & LW_IDLE))
f03010953f572 (yamt 2007-05-17 14:51:11 +0000 511) return;
f03010953f572 (yamt 2007-05-17 14:51:11 +0000 512)
589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000 513) if (__predict_false(bintimecmp(now, &l->l_stime, <)) && !backwards) {
589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000 514) char caller[128];
589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000 515)
589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000 516) #ifdef DDB
589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000 517) db_symstr(caller, sizeof(caller),
589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000 518) (db_expr_t)(intptr_t)__builtin_return_address(0),
589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000 519) DB_STGY_PROC);
589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000 520) #else
589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000 521) snprintf(caller, sizeof(caller), "%p",
589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000 522) __builtin_return_address(0));
589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000 523) #endif
589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000 524) backwards = true;
589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000 525) printf("WARNING: lwp %ld (%s%s%s) flags 0x%x:"
589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000 526) " timecounter went backwards"
589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000 527) " from (%jd + 0x%016"PRIx64"/2^64) sec"
589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000 528) " to (%jd + 0x%016"PRIx64"/2^64) sec"
589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000 529) " in %s\n",
589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000 530) (long)l->l_lid,
589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000 531) l->l_proc->p_comm,
589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000 532) l->l_name ? " " : "",
589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000 533) l->l_name ? l->l_name : "",
589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000 534) l->l_pflag,
589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000 535) (intmax_t)l->l_stime.sec, l->l_stime.frac,
589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000 536) (intmax_t)now->sec, now->frac,
589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000 537) caller);
589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000 538) }
589aa4ee5837b (riastradh 2023-07-13 13:33:55 +0000 539)
949e16d902d16 (yamt 2007-12-22 01:14:53 +0000 540) /* rtime += now - stime */
949e16d902d16 (yamt 2007-12-22 01:14:53 +0000 541) bintime_add(&l->l_rtime, now);
949e16d902d16 (yamt 2007-12-22 01:14:53 +0000 542) bintime_sub(&l->l_rtime, &l->l_stime);
f03010953f572 (yamt 2007-05-17 14:51:11 +0000 543) }
Argh... So it's probably just that we now _see_ that something
went backwards---we just didn't get informed about it
previously...
It seems I'm unable to reproduce the timekeeping issues, at least not
with a non-networked system. I'll bring one of the two systems
downstairs and put it on wired network and start ntpdate / ntpd. I'm
highly interested in other people's statement about their setups!
Along with other people's impressions, I really think we'd
publically collect these individual facts so that others don't need to
test the very same setp.
MfG, JBG
[1] https://wiki.netbsd.org/
--
Attachment:
signature.asc
Description: PGP signature