Subject: "panic: chgproccnt: lost user" on dual-CPU SS20 with 1.6.1_STABLE
To: NetBSD/sparc Discussion List <port-sparc@NetBSD.ORG>
From: Greg A. Woods <woods@weird.com>
List: port-sparc
Date: 07/26/2003 15:37:41
I have a dual-CPU SS20 running a 1.6.1_STABLE GENERIC kernel built from
CVS sources (netbsd-1-6 branch) as of a couple of days ago.  Note that
the second CPU isn't doing anything.

It's dying from a "panic: chgproccnt: lost user", something I've never
actually seen before.

The first time I saw this was when I tried a GENERIC.MP kernel, which
spun up the second CPU (but of course didn't schedule any processes on
it).  So I went back to a GENERIC kernel thinking it was just a bug in
the half-baked MULTIPROCESSOR support.  However GENERIC died of the same
panic at the same point last night.

Now my kernel source does of course include some custom modifications
(one of those mods is demonstrated in the console messages below), but
so far as I can tell on first glance they are not in any involved in any
code path that could trigger this panic, and that seems to be confirmed
by the fact the single-CPU system has never crashed (well, not this way :-).

I've also got a single-CPU SS20 clone running a very similar kernel that
runs fine (it's not quite identical though as it is tuned to the other
system's hardware and was built from somewhat older sources, but it does
have exactly the same local customizations).

I've no doubt the crash is somewhat repeatable given it happened both
times during the run of /etc/daily (and maybe "calendar -a" in
particular).  Note though that as far as I can remember it hasn't
happened every night.

Unfortunately the kernel also always hangs when syncing and never
produces a crash dump.

I'll be building a new kernel this afternoon with DDB and trying to
reproduce the crash on demand, but I thought I'd post this much right
away and see if anyone (who happens to be reading mail today) has any
ideas or more specific directions to point me in.  I'll also instrument
the panic message a bit so that gives some more pertinent information.

The timestamps in brackets are from conserver (suggesting that the crash
happens very near to the time "calendar -a" runs):

[Sat Jul 26 03:15:07 2003]Jul 26 03:15:07 almost /netbsd: calendar: pid 666 [eid 0:31, rid 0:0, svid: 0:0] called seteuid(1) as superuser, setting svid and ruid to 1
[Sat Jul 26 03:15:08 2003]panic: chgproccnt: lost user
syncing disks... [halt sent]                          [Sat Jul 26 03:15:08 2003]
stopping on keyboard abort
Type  'go' to resume
<#0> ok 

-- 
						Greg A. Woods

+1 416 218-0098                  VE3TCP            RoboHack <woods@robohack.ca>
Planix, Inc. <woods@planix.com>          Secrets of the Weird <woods@weird.com>