Subject: kern/14251: raidframe LOCKDEBUG problems
To: None <gnats-bugs@gnats.netbsd.org>
From: None <apb@cequrux.com>
List: netbsd-bugs
Date: 10/15/2001 18:04:04
>Number:         14251
>Category:       kern
>Synopsis:       raidframe LOCKDEBUG problems
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Oct 15 09:08:00 PDT 2001
>Closed-Date:
>Last-Modified:
>Originator:     Alan Barrett
>Release:        NetBSD-current 2001-10-13
>Organization:
Not much
>Environment:
System: NetBSD apb 1.5Y NetBSD 1.5Y (APB) #0: Mon Oct 15 14:42:28 SAST 2001 apb@apb:/usr/src/sys/arch/i386/compile/APB i386
Architecture: i386
Machine: i386

Relevant kernel config options are:
pseudo-device   raid            4       # RAIDframe disk driver
options         RAID_AUTOCONFIG         # auto-configuration of RAID components
options 	LOCKDEBUG

>Description:
There seem to be locking bugs in raidframe.  This PR is from an i386
system with NetBSD-current as of a few days ago, and with options
RAID_AUTOCONFIG.  The bugs do not appear to be machine dependent.

When I build a kernel without options LOCKDEBUG, it seems to work fine.

When I build a kernel with options LOCKDEBUG, as soon as it gets to
the raidframe autoconfigure stage it prints several thousand lines of
debug messages (mostly "malloc with held simple_lock", and backtraces
associated with that) but continues to boot regardless.  It eventually
panics with:

panic: kernel diagnostic assertion "obj == 0 || _simple_lock_held((&obj->vmobjlock))" failed: file "../../../../uvm/uvm_page.c", line 990

>How-To-Repeat:
Set up a system with the root filesystem on a RAID 1 partition.
Boot a kernel with options LOCKDEBUG.
Observe the fun.

Here's an extract from what was printed to the console:

> boot netbsd.n -s
booting wd0a:netbsd.n (howto 0x2)
2965408+68728+403744 [65+189264+161742]=0x39e260
[ using 351528 bytes of netbsd ELF symbol table ]
BIOS CFG: Model-SubMod-Rev: fc-01-00, 0x74<EBDA,KBDINT,RTC,IC2>
Copyright (c) 1996, 1997, 1998, 1999, 2000, 2001
    The NetBSD Foundation, Inc.  All rights reserved.
Copyright (c) 1982, 1986, 1989, 1991, 1993
    The Regents of the University of California.  All rights reserved.

NetBSD 1.5Y (APB) #0: Mon Oct 15 13:35:24 SAST 2001

[... lots deleted ...]

raidattach: Asked for 4 units
Kernelized RAIDframe activated
Searching for raid components...

pool_get(PR_WAITOK) with held simple_lock 0xc03dad78 CPU 0 ../../../../kern/vfs_subr.c:457
pool_get(c04460e0,2,15f,c011bf45,0) at pool_get+0x6f
uvm_map(c03f7f40,c04a1c0c,1000,c03f7e40,f5b8000,0,0,1727) at uvm_map+0x866
uvm_km_kmemalloc(c03f7f40,c03f7e40,1000,1,0) at uvm_km_kmemalloc+0x6d
uvm_km_alloc_poolpage1(c03f7f40,c03f7e40,0,c02c08d9,0) at uvm_km_alloc_poolpage1+0x3b
pool_page_alloc_nointr(1000,0,19,c01a25de,c0431950) at pool_page_alloc_nointr+0x23
pool_get(c04318e0,0,1c9,c02d0f34,c0350a90) at pool_get+0x1c4
getnewvnode(0,0,c08fa400,c04a1d1c) at getnewvnode+0xbb
getdevvp(3,c04a1d98,3,c02bf76a,c0445d98) at getdevvp+0x2f
bdevvp(3,c04a1d98,c08fe800,ffffffff) at bdevvp+0x16
rf_find_raid_components(c0367a80,c0367a46,3,c01661b7,c03d6484,c042e240,c042e448,c03f7ec0,c03d647c,c042e240,c042e448,41727,640,640,c04a1fa0,c018ae42,4,c042e240,c04a1fa0,c018ae28,c049f010,49f000,4a6000,100007ff,c051,0,0,c0100331,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0) at rf_find_raid_components+0xda
raidattach(4,c042e240,c04a1fa0,c018ae28) at raidattach+0x2da
main(0,0,0,0,0) at main+0x396
Component on: wd0e: 2419200
   Row: 0 Column: 0 Num Rows: 1 Num Columns: 2
   Version: 2 Serial Number: 2000091201 Mod Counter: 16122754
   Clean: Yes Status: 0
   sectPerSU: 128 SUsPerPU: 1 SUsPerRU: 1
   RAID Level: 1  blocksize: 512 numBlocks: 2419072
   Autoconfig: Yes
   Contains root partition: Yes
   Last configured as: raid0
[... other components deleted ...]
Component on: wd2e: 2419200
   Row: 0 Column: 1 Num Rows: 1 Num Columns: 2
   Version: 2 Serial Number: 2000091201 Mod Counter: 16122754
   Clean: Yes Status: 0
   sectPerSU: 128 SUsPerPU: 1 SUsPerRU: 1
   RAID Level: 1  blocksize: 512 numBlocks: 2419072
   Autoconfig: Yes
   Contains root partition: Yes
   Last configured as: raid0
[... other components deleted ...]
IPsec: Initialized Security Association Processing.
Found: wd0e at 0,0
Found: wd2e at 0,1
RAID autoconfigure
Configuring raid0:
RAIDFRAME: protectedSectors is 64

malloc with held simple_lock 0xc03f0068 CPU 0 ../../../../dev/raidframe/rf_driver.c:354
malloc(30,5d,0,c01a488f,c0965000) at malloc+0x46
rf_ConfigureMapModule(c03f0084,40,c0365940,165,c0965000) at rf_ConfigureMapModule+0x16
rf_Configure(c0965000,c0973000,c0903300,c04a1f28,c08fd710) at rf_Configure+0x1b9
rf_auto_config_set(c08fd710,c04a1f4c,c042e448,c03cad80) at rf_auto_config_set+0x11d
rf_buildroothack(c08fd710,0,14,0,ffffffff,c042e240,0,0,c042dd7c,c042e240,c04a1fa0,c018aee1,c049f010,49f000,4a6000,100007ff,c051,0,0,c0100331,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0) at rf_buildroothack+0x4e
kthread_run_deferred_queue(c049f010,49f000,4a6000,100007ff,c051) at kthread_run_deferred_queue+0x3b
main(0,0,0,0,0) at main+0x435

[... 367 similar blocks of 9 lines deleted.  They all referred to
 rf_driver.c:354, but did not all have the same backtrace.  ...]

Starting autoconfiguration of RAID set...
Looking for 0,0 in autoconfig
Found: wd0e at 0,0
Looking for 0,1 in autoconfig
Found: wd2e at 0,1
RAIDFRAME: Configure (RAID Level 1): total number of sectors is 2419072 (1181 MB)
RAIDFRAME(RAID Level 1): Using 6 floating recon bufs with no head sep limit

malloc with held simple_lock 0xc09836d4 CPU 0 ../../../../dev/raidframe/rf_psstatus.c:113
malloc(5c,5d,0,93a,c09650f4) at malloc+0x46
rf_ConfigurePSStatus(c09650f4,c0965000,c0973000,1c8,c0965000) at rf_ConfigurePSStatus+0x179
rf_Configure(c0965000,c0973000,c0903300,c04a1f28,c08fd710) at rf_Configure+0xa62
rf_auto_config_set(c08fd710,c04a1f4c,c042e448,c03cad80) at rf_auto_config_set+0x11d
rf_buildroothack(c08fd710,0,14,0,ffffffff,c042e240,0,0,c042dd7c,c042e240,c04a1fa0,c018aee1,c049f010,49f000,4a6000,100007ff,c051,0,0,c0100331,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0) at rf_buildroothack+0x4e
kthread_run_deferred_queue(c049f010,49f000,4a6000,100007ff,c051) at kthread_run_deferred_queue+0x3b
main(0,0,0,0,0) at main+0x435

[... 7 similar blocks of 9 lines deleted ...]

Found: wd2f at 0,0
Found: wd0f at 0,1
RAID autoconfigure
Configuring raid1:

[... hundreds of lines deleted.  They were similar to what happened 
     under "Configuring raid0"  ...]
     
boot device: raid0
root on raid0a dumps on raid0b
init: copying out flags `-s' 3
init: copying out path `/sbin/init' 11
Enter pathname of shell or RETURN for sh: 
We recommend creating a non-root account and using su(1) for root access.
# 

[... I pressed control-D to continue the boot to multiuser mode ...]
[... Many messages from /etc/rc deleted ...]

Building databases...
panic: kernel diagnostic assertion "obj == 0 || _simple_lock_held((&obj->vmobjlock))" failed: file "../../../../uvm/uvm_page.c", line 990

Stopped in pid 103 (kvm_mkdb) at        cpu_Debugger+0x4:       leave
db> t
cpu_Debugger(0,cface474,0,c02c08d9,2fe) at cpu_Debugger+0x4
panic(c03d3e40,c0396d6d,c03973a0,c0396f13,3de) at panic+0xad
__assert(c0396d6d,c0396f13,3de,c03973a0,0) at __assert+0x1f
uvm_pagealloc_strat(cface474,5000,0,0,0) at uvm_pagealloc_strat+0x9e
uvn_findpage(cface474,5000,0,cfad0c94,0,cfad0da8,8,c0395209) at uvn_findpage+0x1
0b
uvn_findpages(cface474,5000,0,cfad0c7c,cfad0c94) at uvn_findpages+0xa8
genfs_getpages(cfad0db8,4,1,0,0) at genfs_getpages+0x434
ffs_getpages(cfad0db8,3,cface54c,c0378e60,cfad0eac) at ffs_getpages+0x98
layer_bypass(cfad0db8,2000,cface54c,0,c03dcf6c) at layer_bypass+0xdf
VOP_GETPAGES(cface54c,5000,0,cfad0ea8,cfad0eac) at VOP_GETPAGES+0x5e
uvn_get(cface54c,5000,0,cfad0ea8,cfad0eac,0,1,0,2,c039525f,397,3bcadce0) at uvn_
get+0x2b
uvm_fault(cf5c0834,48050000,0,1,0) at uvm_fault+0x1084
trap() at trap+0x51f
--- trap (number 6) ---
0x48050478:
db> 

>Fix:
I suspect that raidframe should be using lockmgr() instead of
simple_lock().  Most of raidframe's locking is encapsulated in a few
macros in rf_threadstuff.h, but there are also a few ltsleep() calls
that are not encapsulated by macros.
>Release-Note:
>Audit-Trail:
>Unformatted: