Subject: another (alpha 1.6.x) PMAP SMP simple_lock deadlock -- this time with networking?
To: NetBSD/alpha Discussion List <port-alpha@NetBSD.ORG>
From: Greg A. Woods <woods@weird.com>
List: tech-smp
Date: 06/03/2005 03:24:26
For some annoying reason the full system lockups we've been suffering on
the 4-CPU ES40 AlphaServer I recently put into production with 1.6.x
went from once every 8-10 days (which was sufferable but painful) to
once per day, now three days in a row (which is damn near intollerable).

The only good thing about it has been that I've been able to directly
manage the reboot a couple of times, instead of the on-call guys just
hitting the reset button, and so I've finally got some info from DDB.
(It's also good to finally get reasonably strong evidence that it's
"just" a software bug, and not some hardware issue! :-)

There are three events here, the first is a bit different, but even the
last two seem subtly different (one being in the TCP path, the other in
UDP).

However all three are quite different than other PMAP-related
simple_lock bugs I've seen in the past (and which seem to have been
fixed by various patches that have been posted).

I suppose I should post this as a PR too, but I'm wondering if this is
enough information to give someone who's worked with these kinds of bugs
before a clue as to where to start looking, or whether there's anything
more I can/should collect from DDB.

(Note that a full core dump, while theoretically possible in some cases,
as I now have an 18GB dump device attached and only 16GB of RAM, is
likely impossible in this case as all interrupts seem to be disabled at
the point of the hang and so even a "sync" or "reboot" from DDB fails to
get anywhere and the machine must be halted again and then booted from
scratch from SRM.  Maybe I can halt the other CPUs and then try to get
interrupts re-enabled in the kernel from DDB somehow?  Rumour has it
that the Firewire-enabled-KGDB might work on Alpha, even in this state
since DMA should still work, but hopefully that's not necessary.)

================================================================


RMC>halt in

Returning to COM port

halted CPU 0
CPU 1 is not halted
CPU 2 is not halted
CPU 3 is not halted

halt code = 1
operator initiated halt
PC = fffffc00003bd63c
P00>>>
RMC>halt out

Returning to COM port

P00>>>continue

continuing CPU 0
CP - RESTORE_TERM routine to be called
panic: user requested console halt
Stopped in pid 13 (ioflush) at  cpu_Debugger+0x4:       ret     zero,(ra)
db{0}> trace
cpu_Debugger() at cpu_Debugger+0x4
panic() at panic+0x160005]
console_restart() at console_restart+0x7c
XentRestart() at XentRestart+0x90
--- console restart (from ipl 6) ---
_simple_lock() at _simple_lock+0x15c
selwakeup() at selwakeup+0xc0
sowakeup() at sowakeup+0x2c
udp4_sendup() at udp4_sendup+0x160
udp4_realinput() at udp4_realinput+0x2f8
udp_input() at udp_input+0x37c
ip_input() at ip_input+0xd28
ipintr() at ipintr+0xb405]
netintr() at netintr+0xa0]
softintr_dispatch() at softintr_dispatch+0x15c
esigcode() at esigcode+0x78
prologue botch: displacement 216
--- root of call graph ---
db{0}> call simple_lock_dump
all simple locks:
0xfffffc00c9109610 CPU 0 /building/work/woods/m-NetBSD-1.6/sys/miscfs/genfs/genfs_vnops.c:1146
0xfffffc0006fe8b40 CPU 0 /building/work/woods/m-NetBSD-1.6/sys/arch/alpha/alpha/pmap.c:2532
0xfffffc00d7ae8af8 CPU 0 /building/work/woods/m-NetBSD-1.6/sys/arch/alpha/alpha/pmap.c:2810
0xfffffc000059c290 CPU 2 /building/work/woods/m-NetBSD-1.6/sys/kern/kern_synch.c:601
       0x71 
db{0}> machine cpu
addr            dev     id      flags   ipis    curproc         fpcurproc
0xfffffc0000750088      cpu0    0       5       0       0xfffffc00870d42e8      0xfffffc
03e5f5b4281 
0xfffffe0000047c00      cpu3    3       4       80      0x0     0xfffffc01564d6330
0xfffffe0000047d00      cpu2    2       4       80      0x0     0xfffffc00e50fba18
0xfffffe0000047e00      cpu1    1       4       80      0x0     0xfffffc00c9acc8b8
db{0}> sync 
syncing disks... 
RMC>halt in

Returning to COM port
CP - RESTORE_TERM exited with hlt_req = 0, r0 = 00000007.00000000

halted CPU 0
CPU 1 is not halted
CPU 2 is not halted
CPU 3 is not halted

halt code = 0
PC = fffffc00003bd63c
P00>>>
RMC>halt out

Returning to COM port

P00>>>cont
Slot context is not valid
P00>>>boot
resetting all I/O buses


================================================================


RMC>halt in

Returning to COM port

halted CPU 0
CPU 1 is not halted
CPU 2 is not halted
CPU 3 is not halted

halt code = 1
operator initiated halt
PC = fffffc00003bd63c
P00>>>
RMC>halt out

Returning to COM port

P00>>>cont

continuing CPU 0
CP - RESTORE_TERM routine to be called
panic: user requested console halt
Stopped in pid 17675 (lmtpd) at cpu_Debugger+0x4:       ret     zero,(ra)
db{0}> where
No such command
db{0}> trace
cpu_Debugger() at cpu_Debugger+0x4
panic() at panic+0x160
console_restart() at console_restart+0x7c
XentRestart() at XentRestart+0x90
--- console restart (from ipl 6) ---
_simple_lock() at _simple_lock+0x15c
wakeup() at wakeup+0x60
sowakeup() at sowakeup+0x64
tcp_input() at tcp_input+0x177c
ip_input() at ip_input+0xd28
ipintr() at ipintr+0xb4
netintr() at netintr+0xa0
softintr_dispatch() at softintr_dispatch+0x15c
esigcode() at esigcode+0x78
prologue botch: displacement 216
--- root of call graph ---
db{0}> call simple_lock_dump
all simple locks:
0xfffffc00047429c0 CPU 0 /building/work/woods/m-NetBSD-1.6/sys/arch/alpha/alpha/pmap.c:2931
0xfffffc00f5c72538 CPU 0 /building/work/woods/m-NetBSD-1.6/sys/arch/alpha/alpha/pmap.c:2810
0xfffffc000059c290 CPU 2 /building/work/woods/m-NetBSD-1.6/sys/kern/kern_synch.c:601
       0x7
db{0}> machine cpu
addr            dev     id      flags   ipis    curproc         fpcurproc
0xfffffc0000750088      cpu0    0       5       0       0xfffffc00e28a3428      0xfffffc
00e28a3428
0xfffffe0000047c00      cpu3    3       4       80      0x0     0x0
0xfffffe0000047d00      cpu2    2       4       80      0x0     0xfffffc00870d5148
0xfffffe0000047e00      cpu1    1       4       80      0x0     0xfffffc00f8756bd8
db{0}> sync
syncing disks... 
RMC>halt in

Returning to COM port
CP - RESTORE_TERM exited with hlt_req = 0, r0 = 00000007.00000000

halted CPU 0
CPU 1 is not halted
CPU 2 is not halted
CPU 3 is not halted

halt code = 0
PC = fffffc00003bd63c
P00>>>
RMC> halt out

Returning to COM port

P00>>>cont
Slot context is not valid
P00>>>boot
resetting all I/O buses


================================================================

RMC>halt in

Returning to COM port

halted CPU 0
CPU 1 is not halted
CPU 2 is not halted
CPU 3 is not halted

halt code = 1
operator initiated halt
PC = fffffc00003bd644
P00>>>cont

continuing CPU 0
CP - RESTORE_TERM routine to be called
panic: user requested console halt
Stopped in pid 8570 (pop3d) at  cpu_Debugger+0x4:       ret     zero,(ra)
db{0}> 
RMC>

RMC>halt out

Returning to COM port

db{0}> trace
cpu_Debugger() at cpu_Debugger+0x4
panic() at panic+0x160
console_restart() at console_restart+0x7c
XentRestart() at XentRestart+0x90
--- console restart (from ipl 6) ---
_simple_lock() at _simple_lock+0x164
selwakeup() at selwakeup+0xc0
sowakeup() at sowakeup+0x2c
udp4_sendup() at udp4_sendup+0x160
udp4_realinput() at udp4_realinput+0x2f8
udp_input() at udp_input+0x37c
ip_input() at ip_input+0xd28
ipintr() at ipintr+0xb4
netintr() at netintr+0xa0
softintr_dispatch() at softintr_dispatch+0x15c
esigcode() at esigcode+0x78
prologue botch: displacement 216
--- root of call graph ---
db{0}> call simple_lock_dump()
all simple locks:
0xfffffc00044fe2e0 CPU 0 /building/work/woods/m-NetBSD-1.6/sys/arch/alpha/alpha/pmap.c:2931
0xfffffc010e5423c8 CPU 0 /building/work/woods/m-NetBSD-1.6/sys/arch/alpha/alpha/pmap.c:2810
0xfffffc000059c290 CPU 3 /building/work/woods/m-NetBSD-1.6/sys/kern/kern_synch.c:601
       0x7
db{0}> 
RMC>

================================================================

-- 
						Greg A. Woods

H:+1 416 218-0098  W:+1 416 489-5852 x122  VE3TCP  RoboHack <woods@robohack.ca>
Planix, Inc. <woods@planix.com>          Secrets of the Weird <woods@weird.com>