netbsd-bugs: kern/33234: LOCKDEBUG MP panic in i386 pmap_load from sys

Subject: kern/33234: LOCKDEBUG MP panic in i386 pmap_load from sys_execve
To: None <kern-bug-people@netbsd.org, gnats-admin@netbsd.org,>
From: None <jld@panix.com>
List: netbsd-bugs
Date: 04/10/2006 22:55:00
>Number:         33234
>Category:       kern
>Synopsis:       MP LOCKDEBUG assertion failure, due to uninitialized pmap during exec
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Apr 10 22:55:00 +0000 2006
>Originator:     Jed Davis
>Release:        NetBSD 3.0
>Organization:
PANIX Public Access Internet and UNIX, NYC
>Environment:
System: NetBSD mail2.panix.com 3.0 NetBSD 3.0 (PANIX-STD-MP-DEBUG) #0: Fri Apr  7 04:35:36 EDT 2006  root@juggler.panix.com:/devel/netbsd/3.0/src/sys/arch/i386/compile/PANIX-STD-MP-DEBUG i386
Architecture: i386
Machine: i386
>Description:

We're trying to track down some MP-related panic and deadlock problems
in 3.0/i386, that weren't an issue under 2.0.x; so, this host was
running a kernel with DIAGNOSTIC, DEBUG, and LOCKDEBUG.  Here's the
panic and trace:

panic: kernel debugging assertion "(v == __SIMPLELOCK_LOCKED) || (v == __SIMPLELOCK_UNLOCKED)" failed: file "../../../../arch/x86/x86/lock_machdep.c", line 83
Begin traceback...
__main(c042a890,c0468d20,53,c0468ce0,1) at netbsd:__main
__cpu_simple_lock(cec89cf4,c049b420,1,286,c049b420) at netbsd:__cpu_simple_lock+0xd5
_simple_lock(cec89cf4,c046a3e0,73b,c049b420,cec89cf4) at netbsd:_simple_lock+0x7a
pmap_reference(cec89cf4,c049489c,480,297,282) at netbsd:pmap_reference+0x1a
pmap_load(c0266bb7,cce38000,804a000,480,cec5d1a4) at netbsd:pmap_load+0xc4
copyout(cce38000,480,cec03d14,282,1000) at netbsd:copyout+0xf
ffs_read(cec03cb4,ce7041f4,10001,20001,c03b5360) at netbsd:ffs_read+0x4a6
VOP_READ(ce7041f4,cec03d14,1,ccd20000,0) at netbsd:VOP_READ+0x34
vn_rdwr(0,ce7041f4,804a000,480,1000) at netbsd:vn_rdwr+0xb4
vmcmd_readvn(cedb422c,c227521c,bfc00000,0,0) at netbsd:vmcmd_readvn+0x2f
sys_execve(cec5d1a4,cec03f64,cec03f5c,c04930c4,282) at netbsd:sys_execve+0x620
syscall_plain() at netbsd:syscall_plain+0x1a5

The value at cec89cf4 (the "v" in the assert) is 0xDEADBEEF, as is much
of the rest of the struct pmap that should reside at that location.

We have a core, but it wasn't dumped until after the host panicked again
trying to sync disks (ddb.onpanic was 0); there, stack traces from both
CPUs were written to the serial console at the same time, making them
partially illegible.

>How-To-Repeat:

Letting this host (a mail relay) run a DIAGNOSTIC/DEBUG/LOCKDEBUG kernel
for a few hours usually makes it panic from something; with a regular MP
kernel, it takes longer.

The sys_execve -> vmcmd_readvn -> vn_rdwr -> copyout path seems
particularly troubled.

>Fix:

A seeming workaround: run a uniprocessor kernel.  This costs us a CPU (a 
real one, too, not a "hyperthread"), but it improves stability.