Subject: Re: netbsd-1-6 (@2003/08/10) frozen solid on 2-CPU AS4000, sitting at DDB now
To: Michael L. Hitch <mhitch@lightning.msu.montana.edu>
From: Greg A. Woods <woods@weird.com>
List: port-alpha
Date: 11/01/2003 13:51:26
[ On Friday, August 29, 2003 at 17:54:46 (-0600), Michael L. Hitch wrote: ]
> Subject: Re: netbsd-1-6 (@2003/08/10) frozen solid on 2-CPU AS4000, sitting  at DDB now
>
>   The problem you have appears to be the MP alpha hang problem that a
> number of other people have experienced.  The PR for that problem hasn't
> been submitted yet.

I suppose I should turn one of these messages into a PR then....

> On Fri, 29 Aug 2003, Greg A. Woods wrote:
> > Here's the latest trace and indeed sched_lock is "1" as predicted:
> ...
> > db{0}> trace
> > cpu_Debugger() at cpu_Debugger+0x4
> > panic() at panic+0x168
> > console_restart() at console_restart+0x74
> > XentRestart() at XentRestart+0x90
> > --- console restart (from ipl 4) ---
> > _simple_lock() at _simple_lock+0x15c
> > wakeup() at wakeup+0xcc
> > schedcpu() at schedcpu+0x34c
> > softclock() at softclock+0x2b4
> > hardclock() at hardclock+0x7c0
> > interrupt() at interrupt+0x180
> > XentInt() at XentInt+0x1c
> > --- interrupt (from ipl 0) ---
> > _simple_unlock() at _simple_unlock+0x208
> > pool_put() at pool_put+0x6c
> > pmap_pv_remove() at pmap_pv_remove+0x10c
> > pmap_remove_mapping() at pmap_remove_mapping+0x404
> > pmap_do_remove() at pmap_do_remove+0x3f8
> > pmap_remove() at pmap_remove+0x78
> 
>    The pmap routines suggest that the kernel pmap will also be locked by
> the cpu - and my guess is that the other cpu is trying to obtain a lock on
> the kernel pmap.
> 
> > --- user mode ---
> > db{0}> examine/d sched_lock
> > sched_lock:     1
> 
>   With LOCKDEBUG, I think you can also do:
> 
> db{0}> call simple_lock_dump()
> 
>   This should display all the simple locks held.  One should be sched_lock
> and probably be held by cpu 1.  The other would probably be the kernel
> pmap lock held by cpu 0.

Here's a new one, this time with the above simple_lock_dump() output:

RCM>status

Firmware Rev: V1.1
Escape Sequence: ^]^]RCM
Remote Access: DISABLE
Alerts: DISABLE
Alert Pending: NO
Temp (C): 34.0
RCM Power Control: ON
External Power: OFF
Server Power: ON
RCM>halt

Focus returned to COM port

halted CPU 0
CPU 1 is not halted

halt code = 1
operator initiated halt
PC = fffffc000044f544
P00>>>cont

continuing CPU 0
CP - RESTORE_TERM routine to be called
panic: user requested console halt
Stopped in pid 8809 (sh) at     cpu_Debugger+0x4:       ret     zero,(ra)
db{0}> trace
cpu_Debugger() at cpu_Debugger+0x4
panic() at panic+0x168
console_restart() at console_restart+0x74
XentRestart() at XentRestart+0x90
--- console restart (from ipl 4) ---
_simple_lock() at _simple_lock+0x164
wakeup() at wakeup+0xcc
schedcpu() at schedcpu+0x34c
softclock() at softclock+0x2b4
hardclock() at hardclock+0x7c0
interrupt() at interrupt+0x180
XentInt() at XentInt+0x1c
--- interrupt (from ipl 0) ---
_simple_unlock() at _simple_unlock+0x208
pmap_pv_enter() at pmap_pv_enter+0x12c
pmap_enter() at pmap_enter+0x9d4
uvm_fault() at uvm_fault+0x2020
uvm_fault_wire() at uvm_fault_wire+0x74
uvm_fork() at uvm_fork+0x98
fork1() at fork1+0x548
sys_fork() at sys_fork+0x38
syscall_plain() at syscall_plain+0x164
XentSys() at XentSys+0x5c
--- syscall (2) ---
--- user mode ---
db{0}> call simple_lock_dump()
all simple locks:
0xfffffc00007003f0 CPU 0 /usr/src/sys/uvm/uvm_fault.c:876
0xfffffc00007e8718 CPU 0 /usr/src/sys/arch/alpha/alpha/pmap.c:1674
0xfffffc00006c04f8 CPU 1 /usr/src/sys/kern/kern_synch.c:601
       0x7
db{0}> 

Is it odd that this time it's an unlock call at the top of the stack, or
is that just a coincidence due to which CPU got stuck this time?

I can try getting kgdb working on the console if that would help anyone
debug this further.....

Back to a non-MP kernel I guess....   :-(

-- 
						Greg A. Woods

+1 416 218-0098                  VE3TCP            RoboHack <woods@robohack.ca>
Planix, Inc. <woods@planix.com>          Secrets of the Weird <woods@weird.com>