Subject: Re: port-alpha/25599: Alpha SMP system hangs hard
To: None <mhitch@netbsd.org>
From: Greg A. Woods <woods@weird.com>
List: netbsd-bugs
Date: 06/03/2004 19:34:48
[[ I've pointed the reply-to to port-alpha instead of cluttering the PR
with discussion -- but I guess any further details or fixes should be
copied to the PR ]]

[ On Sunday, May 16, 2004 at 13:23:10 (-0600), mhitch@netbsd.org wrote: ]
> Subject: port-alpha/25599: Alpha SMP system hangs hard
> 
> If the pmap locking is not really required in pmap_activate() when it's the
> kernel pmap, the previous patch should be sufficient.

I have had very good luck with Michael's two-line patch on 1.6.2_STABLE
on my dual-CPU AS4000.  I've done most of a full build with sources from
NFS, objects on a local mlx(4) array, using "build.sh -j 4".

Simultaneously the machine runs my caching DNS with bind-8 and a few
xterms for "links" and such, as well as an instance of Mozilla-1.6 with
a mail window open on my main IMAP inbox (i.e. the regular 10-min check
for new mail causing mozilla to run regularly).

That is it worked great until just a few minutes ago when I tried doing
some more web browsing with Mozilla while "build.sh -U" was running
again (after apparently having hit some kind of parallel make issue that
bombed it out for no good reason).

Of course without the patch though an MP kernel would hang sometimes
without any real provocation, and even without Mozilla's help, so indeed
the initial patch alone seems to be a 90-95% solution to the problem (at
least when your machine has enough memory to avoid paging much?).


Here's the console output, complete with a stack backtrace for whatever
it's worth):


simple_lock: locking against myself
lock: 0xfffffc00007ee938, currently at: /proven/work/woods/m-NetBSD-1.6/sys/arch/alpha/alpha/pmap.c:2140
on cpu 0
last locked: /proven/work/woods/m-NetBSD-1.6/sys/arch/alpha/alpha/pmap.c:2721
last unlocked: /proven/work/woods/m-NetBSD-1.6/sys/arch/alpha/alpha/pmap.c:2734
alpha trace requires known PC =eject=
Stopped in pid 8 (ioflush) at   cpu_Debugger+0x4:       ret     zero,(ra)
db{0}> trace
cpu_Debugger() at cpu_Debugger+0x4
_simple_lock() at _simple_lock+0x128
pmap_extract() at pmap_extract+0x84
uvm_km_pgremove_intrsafe() at uvm_km_pgremove_intrsafe+0x48
uvm_unmap_remove() at uvm_unmap_remove+0x1a8
uvm_unmap() at uvm_unmap+0x174
uvm_km_free() at uvm_km_free+0x34
free() at free+0x254
softdep_disk_write_complete() at softdep_disk_write_complete+0x2f0
biodone() at biodone+0xa0
lddone() at lddone+0xb4
ld_mlx_handler() at ld_mlx_handler+0x114
mlx_intr() at mlx_intr+0xec
alpha_shared_intr_dispatch() at alpha_shared_intr_dispatch+0x6c
kn300_iointr() at kn300_iointr+0x50
interrupt() at interrupt+0x32c
XentInt() at XentInt+0x1c
--- interrupt (from ipl 0) ---
pmap_tlb_shootdown() at pmap_tlb_shootdown+0x248
pmap_changebit() at pmap_changebit+0x148
pmap_clear_modify() at pmap_clear_modify+0xd4
uvn_findpage() at uvn_findpage+0x118
uvn_findpages() at uvn_findpages+0x130
genfs_putpages() at genfs_putpages+0x930
end() at 0xfffffc00017e1b40
prologue botch: displacement 16384
frame size botch: adjust register offsets?
prologue botch: displacement 8192
frame size botch: adjust register offsets?
prologue botch: displacement 16384
frame size botch: adjust register offsets?
prologue botch: displacement 24576
frame size botch: adjust register offsets?
frame size botch: adjust register offsets?
frame size botch: adjust register offsets?
frame size botch: adjust register offsets?
frame size botch: adjust register offsets?
--- root of call graph ---
db{0}> 


A 'sync' gave an almost immediate "panic: lockmgr: locking against
myself" (with a somewhat expected "tlp:0 receive ring overrun"
proceeding it), so I didn't even bother trying to write out a dump.

-- 
						Greg A. Woods

+1 416 218-0098                  VE3TCP            RoboHack <woods@robohack.ca>
Planix, Inc. <woods@planix.com>          Secrets of the Weird <woods@weird.com>