Subject: RAIDframe crash again
To: None <current-users@netbsd.org>
From: Kazushi (Jam) Marukawa <jam@pobox.com>
List: current-users
Date: 07/12/2001 14:57:24
Hi,

My system is crashed and the situation is similar to Chris
Jones one.  FYI, the message-id of his mail is
<20010508165041.C6074@mt.sri.com>.

The real reason is two hard drives failure in a 4 drives
RAID5 system.  Then, system was crashed.  Is there any way
to stop this crash?  A copy of messages is below.  This is
not all, I just grepped it by "raid" keyword.

Jul 10 19:51:29 sou /netbsd: raid0: IO Error.  Marking /dev/wd3e as failed.
Jul 10 19:51:29 sou /netbsd: raid0: node (Rop) returned fail, rolling backward
Jul 10 19:51:29 sou /netbsd: raid0: DAG failure: w addr 0x6048eb0 (100961968) nblk 0x80 (128) buf 0xc4add000
Jul 10 20:20:01 sou /netbsd: raid0: IO Error.  Marking /dev/wd1e as failed.
Jul 10 20:20:01 sou /netbsd: raid0: node (Rrd) returned fail, rolling backward
Jul 10 20:20:01 sou /netbsd: raid0: DAG failure: w addr 0x623ee80 (103018112) nblk 0x8 (8) buf 0xc4add000
Jul 10 20:20:14 sou /netbsd: raid0: node (Rrd) returned fail, rolling backward
Jul 10 20:20:14 sou /netbsd: raid0: DAG failure: w addr 0x623ee88 (103018120) nblk 0x8 (8) buf 0xc4ade000
Jul 10 20:20:16 sou /netbsd: raid0: node (Rrd) returned fail, rolling backward
Jul 10 20:20:16 sou /netbsd: raid0: DAG failure: w addr 0x4a0 (1184) nblk 0x10 (16) buf 0xca885000
Jul 10 20:20:19 sou /netbsd: raid0: node (Rrd) returned fail, rolling backward
Jul 10 20:20:19 sou /netbsd: raid0: DAG failure: w addr 0x4b0 (1200) nblk 0x10 (16) buf 0xcb325000

Here is a trace after the crash.  I hope this help some
developper to fix this.

db> trace
cpu_Debugger(c0b5d048,c0b5d000,1,d40af8ec,c01f023c) at cpu_Debugger+0x4
panic(c05ddc80,1c7,c048b5c0,c048b6cd,0) at panic+0x8e
rf_State_CreateDAG(c0b5d000,c0b5d000,3fff10b9,c0b5d0e4,0) at rf_State_CreateDAG+0x18c
rf_ContinueRaidAccess(c0b5d000,c0b38000,c0d421a0,0,c0b5d000) at rf_ContinueRaidAccess+0x92
rf_DoAccess(c0b38000,72,1,f738280,0,10,0,c4add000,c0d421a0,0,0,8,0,0,0,c0b37820) at rf_DoAccess+0x254
raidstart(c0b38000) at raidstart+0x215
raidstrategy(c0d421a0,c0d421a0,d40afa08,c033b4c2,d40afa14) at raidstrategy+0x170
spec_strategy(d40afa14,2,c0d421a0,f738280,d3d1a1c0) at spec_strategy+0x47
ufs_strategy(d40afa14,c04903e0,c0d421a0,d40afb54,c02344d7) at ufs_strategy+0xae
VOP_STRATEGY(c0d421a0) at VOP_STRATEGY+0x28
genfs_getpages(d40afb6c,4000,0,d3dc7aa8,c0490fc0) at genfs_getpages+0xcb3
VOP_GETPAGES(d3dc7aa8,4000,0,d40afbc0,d40afc00,0,1,0,2) at VOP_GETPAGES+0x58
ubc_fault(d40afd1c,d377f000,d40afc9c,1,0,0,1,2) at ubc_fault+0x179
uvm_fault(c05c9080,d377f000,0,1,1000) at uvm_fault+0x635
trap() at trap+0x461
--- trap (number 6) ---
copyout(d377f000,1000,d40aff10,d3dc7a,4000) at copyout+0x98
ffs_read(d40afe98,0,c0490660,d3dc7aa8,d40aff10) at ffs_read+0x101
VOP_READ(d3dc7aa8,d40aff10,0,c0c79980,d3dc7aa8) at VOP_READ+0x38
vn_read(d3e6f4e8,d3e6f504,d40aff10,c0c79980,1) at vn_read+0x78
dofileread(d3e721e0,3,d3e6f4e8,805b000,1000) at dofileread+0x93
sys_read(d3e721e0,d40aff88,d40aff80) at sys_read+0x4e
syscall_plain(1f,1f,1,1000,bfbfd8f8) at syscall_plain+0x98
db> c
syncing disks... panic: lockmgr: locking against myself
Stopped in pid 24086 (cat) at   cpu_Debugger+0x4:       leave
db> c

dumping to dev 0,1 offset 696232
dump 255 254 253 252 251 250 249 248 247 246 245 244 243 242 241 240 239 238 237


Both hard drives that raid marked failure are OK with
manufacture's test program.  Maybe, those are going bad now,
but it works for now.  So, I connected only 3 out of 4
drives and start using them to make a backup.  I configured
raid5 with -C and did fsck.  FSCK asked me to remove some
files to fix file system.  I copied those files with a hope
that only inode is corrupted but data is correct.  After
fsck, I copied those files into the original place.  System
crashed again.  Sigh.  However, after that, I mean
restarting the system and fsck -p, I could copy those files
into the original place.  Here is a trace after this crash.

db> trace
VOP_STRATEGY(c0c45248,c0bef84c,c0c35880,3fff10b9,31ad080) at VOP_STRATEGY+0x1f
rf_DispatchKernelIO(c0bef84c,c0c35880,c0c03960,c0c2ccc0,c0bef360) at rf_DispatchKernelIO+0x1ca
rf_DiskIOEnqueue(c0bef84c,c0c35880,1,c0bef360,0) at rf_DiskIOEnqueue+0x1cd
rf_DiskReadFuncForThreads(c0bef360,c0bef360,d406c618,c01ccf28,c0bef360) at rf_DiskReadFuncForThreads+0x13c
FireNode(c0bef360) at FireNode+0x4a
FireNodeList(c0bef360,c0bfb6c0,0,0,1) at FireNodeList+0x158
PropagateResults(c0bfb6c0,0,c0bfb6c0,d406c684,c01cd52c) at PropagateResults+0x324
ProcessNode(c0bfb6c0,0,d406c694,c01bf584,c0bfb6c0) at ProcessNode+0xbd
rf_FinishNode(c0bfb6c0,0,d406c6a4,c01ccbfe,c0bfb6c0) at rf_FinishNode+0x18
rf_NullNodeFunc(c0bfb6c0,c0bff690,d406c6bc,c01ccdbc,c0bfb6c0) at rf_NullNodeFunc+0x14
FireNode(c0bfb6c0) at FireNode+0x4a
FireNodeArray(1,c0bff690,0,c0d5a700,3fff10b9) at FireNodeArray+0x158
rf_DispatchDAG(c0bff680,c01ef758,c0d5a700) at rf_DispatchDAG+0xf1
rf_State_ExecuteDAG(c0c2f800,c0c2f800,3fff10b9,c0c2f8e4,0) at rf_State_ExecuteDAG+0x14f
rf_ContinueRaidAccess(c0c2f800,c0b3b000,c0c3ea28,0,c0c2f800) at rf_ContinueRaidAccess+0x9a
rf_DoAccess(c0b3b000,77,1,9507190,0,80,0,c4afd000,c0c3ea28,0,0,8,0,0,0,c0b19020) at rf_DoAccess+0x254
raidstart(c0b3b000) at raidstart+0x215
raidstrategy(c0c3ea28,c0c3ea28,d406c828,c033b4c2,d406c834) at raidstrategy+0x170
spec_strategy(d406c834,10,c0c3ea28,9507190,d3ec2730) at spec_strategy+0x47
ufs_strategy(d406c834,c04903e0,c0c3ea28,d406c8c8,c0234c3c) at ufs_strategy+0xae
VOP_STRATEGY(c0c3ea28) at VOP_STRATEGY+0x28
genfs_putpages(d406c8dc,d3ebdce8,1,c0491000,d3ebdce8) at genfs_putpages+0x328
VOP_PUTPAGES(d3ebdce8,d406c9bc,10,21,0) at VOP_PUTPAGES+0x3f
uvn_put(d3ebdce8,d406c9bc,10,21,0) at uvn_put+0x16
uvm_pager_put(d3ebdce8,c0a305f8,d406c9b4,d406c9b8,21) at uvm_pager_put+0xa4
uvn_flush(d3ebdce8,20000,0,30000,0,1) at uvn_flush+0x4b7
ffs_write(d406ce84,1,c04906a0,d3ebdce8,d406cf10) at ffs_write+0x34a
VOP_WRITE(d3ebdce8,d406cf10,1,c0ce9980,d3ebdce8) at VOP_WRITE+0x38
vn_write(d3f15b50,d3f15b6c,d406cf10,c0ce9980,1) at vn_write+0x9e
dofilewrite(d405c3d8,4,d3f15b50,805ad20,10000) at dofilewrite+0x94
sys_write(d405c3d8,d406cf88,d406cf80) at sys_write+0x4e
syscall_plain(1f,1f,4,0,bfbfcd10) at syscall_plain+0x98
db>

-- Kazushi