Subject: Re: Random panics with 2-0 on CS20
To: sgimips NetBSD list <sgimips@mrynet.com>
From: Michael L. Hitch <mhitch@lightning.msu.montana.edu>
List: port-alpha
Date: 05/28/2004 00:14:17
On Fri, 28 May 2004, sgimips NetBSD list wrote:

> panic: fpsave ipi didn't

  I've seen these a few times on my CS20s, but not very often.

> Stopped in pid 20304.1 (sed) at netbsd:cpu_Debugger+0x4:        ret     zero,(ra)
> db{1}> bt
> cpu_Debugger() at netbsd:cpu_Debugger+0x4
> panic() at netbsd:panic+0x1f8
> fpusave_proc() at netbsd:fpusave_proc+0x258
> cpu_lwp_free() at netbsd:cpu_lwp_free+0x38
> exit1() at netbsd:exit1+0x428
> sys_exit() at netbsd:sys_exit+0x48
> syscall_plain() at netbsd:syscall_plain+0xd4
> XentSys() at netbsd:XentSys+0x60
> --- syscall (1) ---
> --- user mode ---
> db{1}> reboot
> syncing disks...
>
> At this point the machine just hangs hard with no reboot.
...
> Suggestions?

  The panic results when one CPU is trying to get the other CPU(s) to
synchronize the FP state.  If the other CPU(s) fail to perform that
request after a certain amount of time, it will panic.

  Looking at the path from the syscall to exit1(), I see one potential
deadlock condition.  It looks like the current CPU will have done a
KERNEL_PROC_LOCK() and doesn't release that lock until near the end of
exit1() - after the cpu_lwp_frer() returns.

  If the CPU that's supposed to respond to the ipi request has interrupts
blocked and is trying to aquire the KERNEL_PROC_LOCK(), it could hang
there and never get the ipi interrupt.  I'm not familier enough with
KERNEL_PROC_LOCK() to know if this might happen, but I have seen a
deadlock between PMAP_LOCK() and SCHED_LOCK() hang the system.

  I suspect there may also be a deadlock occuring when the reboot
attempts to clean up and the current CPU will hang at that point.  I've
resorted to doing a reboot 104 or reboot 10c to try to get a dump without
doing a disk sync, but sometimes the scsi driver is not in a state to
perform the dump and it either hangs or aborts.

  Running a kernel compiled with gdb debugging information and getting a
good dump makes it easier to attempt to determine what the system was
doing when it panics or hangs.  It can be done with DDB, but it's a lot
harder to dig through the structures that way.  Also, using LOCKDEBUG can
provide more information in the kernel locking structures.  It will
maintain information on what locks are held by which CPU, which is
sometimes quite useful in determining deadlock problems.

--
Michael L. Hitch			mhitch@montana.edu
Computer Consultant
Information Technology Center
Montana State University	Bozeman, MT	USA