port-alpha: Re: netbsd-1-6 (@2003/08/10) frozen solid on 2-CPU AS4000, sitting

Subject: Re: netbsd-1-6 (@2003/08/10) frozen solid on 2-CPU AS4000, sitting
To: NetBSD/alpha Discussion List <port-alpha@NetBSD.org>
From: Michael L. Hitch <mhitch@lightning.msu.montana.edu>
List: port-alpha
Date: 08/14/2003 13:51:27
On Wed, 13 Aug 2003, Greg A. Woods wrote:

> I had done a fair bit of mucking about with my new as4000 since
> upgrading it with a second CPU and some more RAM, but today as I started
> a full rebuild of /usr/xsrc (over NFS) and it froze solid not long after
> it started.  No pings, and no echo on the serial console.

  This looks like the same Alpha SMP hangs I've been seeing on my CS20.

> Now the first big problem is that I can't get it to drop into the
> debugger when I send it a BREAK on the serial console.  However I can of

  Both cpus are probably running with interrupts disabled and can not
respond to serial interrupts.

> 	db{0}> trace
> 	cpu_Debugger() at cpu_Debugger+0x4
> 	panic() at panic+0x168
> 	console_restart() at console_restart+0x74
> 	XentRestart() at XentRestart+0x90
> 	--- console restart (from ipl 4) ---
> 	schedclock() at schedclock+0x88

  This location appears to correspond to the SCHED_LOCK() in schedclock().
If you examine the sched_lock variable, it's probably 1.  I suspect the
other cpu has the sched_lock, and the current cpu is waiting for the other
cpu to release the lock.  The other cpu probably is spinning waiting for a
lock the current cpu has (I would suspect the kernel pmap lock).

> Now, what do I do with CPU#1?  There's no "prom" command in ddb so I
> can't jump back to the SRM and halt it, though I suppose I could try
> using RCM to halt CPU#0 again....

  If what I suspect is true, you can't do anything with the other cpu.
DDB has a command to switch to the other cpu, but that will only work if
the other cpu has been suspended.  If the other cpu is spinning on a lock
with interrupts disabled, it can't get the IPI interrupt to tell it to
suspend.  I've halted the second cpu before doing the continue, but that
doesn't work any better since that cpu is now stopped and can't suspend.

> BTW, I find it really annoying that a console halt triggers a panic().
> Is there really no way to continue the OS from DDB on alpha?

  I presume that the full state of the cpu needs to be restored on entry
from the SRM continue command.  I can't remember if I've ever been able to
do this with OpenVMS or Tru64.

> Unfortunately since upgrading the RAM I no longer have enough space on
> my current dump partition to leave a system core dump.

  Heh - I had that problem during my debugging.  I finally took the time
to repartition my disk to get enough space for a dump.

> I'll keep it sitting at DDB for an hour or so in case someone has any
> suggestions for gathering further information of use in debugging this
> freeze....

  There's not much more you easily get.  Displaying the contents of
sched_lock would be one.  The other would be to look at the kernel pmap
lock variable (not quite as easy, since it's not a simple static variable
like sched_lock).

  Running a LOCKDEBUG kernel, built with gdb debugging enabled, and
getting a crash dump is probably the best bet on getting more useful
information.  I've been able to get a few dumps on my CS20, although the
more easily reproduced hangs leave the SCSI driver in a non-useable state
and can't dump the memory to disk.

  The LOCKDEBUG keeps track of which locks each cpu has, and where the
locks were obtained.

> BTW, 'top', which happened to be running at the time showed:

  Top probably won't show much because it will be a snapshot some time
before the hang actually occurs.

  A 'ps' from DDB can display some process information at the time of the
hang.  I can't remember at the moment if there's anything in that display
that can be used.  I'll need to go back and look at the console captures
I got from my hangs.


--
Michael L. Hitch			mhitch@montana.edu
Computer Consultant
Information Technology Center
Montana State University	Bozeman, MT	USA