current-users: Re: RAIDframe crash

Subject: Re: RAIDframe crash
To: Chris Jones <chris@cjones.org>
From: Greg Oster <oster@cs.usask.ca>
List: current-users
Date: 05/08/2001 17:20:58
Chris Jones writes:
> Three problems:
> 
> 1. Panic, apparently having something to do with RAIDframe.
> 
> 2. How to recover the RAID data after said panic.
> 
> 3. I can't get savecore to do its thing and give me a core dump.
> 
> Problem 1:  Here's partial /var/log/messages output.  At the time of
> the crash, I was copying a bunch of data onto a partition on raid1
> (RAID 5) from an NFS mount.
> 
> May  7 16:02:01 gamera /netbsd: NetBSD 1.5.1_ALPHA (GAMERA) #0: Tue Jan 23 =
> 11:43:12 MST 2001
> May  7 16:02:01 gamera /netbsd:     chris@legolas.mt.sri.com:/usr/local/tar=
> pit/netbsd/syssrc/sys/arch/i386/compile/GAMERA
> May  7 16:02:02 gamera /netbsd: fxp0 at pci0 dev 10 function 0: Intel i8255=
> 7 Ethernet, rev 8
> May  7 16:02:02 gamera /netbsd: fxp0: interrupting at irq 9
> May  7 16:02:02 gamera /netbsd: fxp0: Ethernet address 00:90:27:87:b0:7e, 1=
> 0/100 Mb/s
> May  7 16:02:02 gamera /netbsd: inphy0 at fxp0 phy 1: i82555 10/100 media i=
> nterface, rev. 4
> May  7 16:02:02 gamera /netbsd: inphy0: 10baseT, 10baseT-FDX, 100baseTX, 10=
> 0baseTX-FDX, auto
> May  7 16:02:02 gamera /netbsd: siop1 at pci0 dev 11 function 0: Symbios Lo=
> gic 53c875 (ultra-wide scsi)
> May  7 16:02:02 gamera /netbsd: siop1: using on-board RAM
> May  7 16:02:02 gamera /netbsd: siop1: interrupting at irq 10
> May  7 16:02:02 gamera /netbsd: scsibus1 at siop1: 16 targets, 8 luns per t=
> arget
> May  7 16:02:02 gamera /netbsd: siop0 at pci0 dev 14 function 0: Symbios Lo=
> gic 53c875j (ultra-wide scsi)
> May  7 16:02:02 gamera /netbsd: siop0: using on-board RAM
> May  7 16:02:02 gamera /netbsd: siop0: interrupting at irq 9
> May  7 16:02:02 gamera /netbsd: scsibus0 at siop0: 16 targets, 8 luns per t=
> arget
> May  7 16:02:03 gamera /netbsd: scsibus1: waiting 2 seconds for devices to =
> settle...
> May  7 16:02:03 gamera /netbsd: siop1: target 0 using 8bit transfers=20
> May  7 16:02:03 gamera /netbsd: siop1: target 0 now synchronous at 20.0Mhz,=
>  offset 16
> May  7 16:02:03 gamera /netbsd: siop1: target 0 using tagged queuing
> May  7 16:02:03 gamera /netbsd: sd2 at scsibus1 target 0 lun 0: <IBM, DMVS3=
> 6D, 0210> SCSI3 0/direct fixed
> May  7 16:02:03 gamera /netbsd: siop1: target 0 using 16bit transfers
> May  7 16:02:03 gamera /netbsd: siop1: target 0 now synchronous at 20.0Mhz,=
>  offset 16
> May  7 16:02:03 gamera /netbsd: sd2: 35003 MB, 11739 cyl, 20 head, 305 sec,=
>  512 bytes/sect x 71687340 sectors
> May  7 16:02:03 gamera /netbsd: siop1: target 1 using 8bit transfers
> May  7 16:02:03 gamera /netbsd: siop1: target 1 now synchronous at 20.0Mhz,=
>  offset 16
> May  7 16:02:03 gamera /netbsd: siop1: target 1 using tagged queuing
> May  7 16:02:03 gamera /netbsd: sd3 at scsibus1 target 1 lun 0: <IBM, DMVS3=
> 6D, 0210> SCSI3 0/direct fixed
> May  7 16:02:03 gamera /netbsd: siop1: target 1 using 16bit transfers
> May  7 16:02:03 gamera /netbsd: siop1: target 1 now synchronous at 20.0Mhz,=
>  offset 16
> May  7 16:02:03 gamera /netbsd: sd3: 35003 MB, 11739 cyl, 20 head, 305 sec,=
>  512 bytes/sect x 71687340 sectors
> May  7 16:02:03 gamera /netbsd: st0 at scsibus1 target 5 lun 0: <HP, C1557A=
> , U812> SCSI2 1/sequential removable
> May  7 16:02:03 gamera /netbsd: st0: siop1: target 5 now synchronous at 10.=
> 0Mhz, offset 16
> May  7 16:02:03 gamera /netbsd: density code 37, variable blocks, write-ena=
> bled
> May  7 16:02:04 gamera /netbsd: ch0 at scsibus1 target 5 lun 1: <HP, C1557A=
> , U812> SCSI2 8/changer removable
> May  7 16:02:04 gamera /netbsd: ch0: 6 slots, 1 drive, 0 pickers, 0 portals
> May  7 16:02:04 gamera /netbsd: scsibus0: waiting 2 seconds for devices to =
> settle...
> May  7 16:02:04 gamera /netbsd: siop0: target 0 using tagged queuing=20
> May  7 16:02:04 gamera /netbsd: sd0 at scsibus0 target 0 lun 0: <SEAGATE, S=
> T34572W, 0784> SCSI2 0/direct fixed
> May  7 16:02:04 gamera /netbsd: siop0: target 0 using 16bit transfers
> May  7 16:02:04 gamera /netbsd: siop0: target 0 now synchronous at 20.0Mhz,=
>  offset 15
> May  7 16:02:04 gamera /netbsd: sd0: 4340 MB, 6300 cyl, 8 head, 176 sec, 51=
> 2 bytes/sect x 8888924 sectors
> May  7 16:02:04 gamera /netbsd: siop0: target 1 using tagged queuing
> May  7 16:02:04 gamera /netbsd: sd1 at scsibus0 target 1 lun 0: <SEAGATE, S=
> T32272W, 0784> SCSI2 0/direct fixed
> May  7 16:02:04 gamera /netbsd: siop0: target 1 using 16bit transfers
> May  7 16:02:04 gamera /netbsd: siop0: target 1 now synchronous at 20.0Mhz,=
>  offset 15
> May  7 16:02:04 gamera /netbsd: sd1: 2157 MB, 6300 cyl, 4 head, 175 sec, 51=
> 2 bytes/sect x 4419464 sectors

The date jumps here... is there data missing?  If so, at least some of it is 
critical to solving this... 

> May  8 16:10:46 gamera /netbsd: sd2(siop1:0:0): command timeout

"Uh oh.."  This will not make RAIDframe happy if the IO didn't complete... but 
no big deal... RAID 5 will deal with a single disk failure...  I'm guessing 
that sd2e, sd3e, and sd4e are all in this RAID 5 set... but where is sd4?  
If it's not there, then you've already had 2 failures, which is fatal in a 
RAID 5 set...

> May  8 16:10:46 gamera /netbsd: siop1: scsi bus reset
> May  8 16:10:46 gamera /netbsd: cmd 0xc0670240 (target 0:0) in reset list
> May  8 16:10:46 gamera /netbsd: cmd 0xc0670100 (target 0:0) in reset list
> May  8 16:10:46 gamera /netbsd: cmd 0xc06701c0 (target 0:0) in reset list
> May  8 16:10:46 gamera /netbsd: cmd 0xc0670300 (target 0:0) in reset list
> May  8 16:10:46 gamera /netbsd: cmd 0xc0670080 (target 1:0) in reset list
> May  8 16:10:46 gamera /netbsd: cmd 0xc06702c0 (target 1:0) in reset list
> May  8 16:10:46 gamera /netbsd: cmd 0xc0670000 (target 1:0) in reset list
> May  8 16:10:46 gamera /netbsd: cmd 0xc0670280 (target 1:0) in reset list
> May  8 16:10:46 gamera /netbsd: cmd 0xc0670240 (status 2) about to be proce=
> ssed
> May  8 16:10:46 gamera /netbsd: cmd 0xc0670100 (status 2) about to be proce=
> ssed
> May  8 16:10:46 gamera /netbsd: cmd 0xc06701c0 (status 2) about to be proce=
> ssed
> May  8 16:10:46 gamera /netbsd: cmd 0xc0670300 (status 2) about to be proce=
> ssed
> May  8 16:10:46 gamera /netbsd: cmd 0xc0670080 (status 2) about to be proce=
> ssed
> May  8 16:10:46 gamera /netbsd: cmd 0xc06702c0 (status 2) about to be proce=
> ssed
> May  8 16:10:46 gamera /netbsd: cmd 0xc0670000 (status 2) about to be proce=
> ssed
> May  8 16:10:46 gamera /netbsd: cmd 0xc0670280 (status 2) about to be proce=
> ssed
> May  8 16:10:46 gamera /netbsd: siop1: target 0 using 16bit transfers
> May  8 16:10:46 gamera /netbsd: siop1: target 0 now synchronous at 20.0Mhz,=
>  offset 16
> May  8 16:10:46 gamera /netbsd: siop1: target 1 using 16bit transfers
> May  8 16:10:46 gamera /netbsd: siop1: target 1 now synchronous at 20.0Mhz,=
>  offset 16
> May  8 16:10:47 gamera /netbsd: siop1: unexpected phase mismatch 6
> May  8 16:10:47 gamera /netbsd: sd3(siop1:1:0): parity error

"Uh Oh#2".  If sd3 is in the RAID set, RAIDframe is going to be really upset, 
as with 2 (or is it 3 now?) disks gone, it's pretty much game over.  
(And RAIDframe will turn out the lights, rather than continuing on...) 
(It should probably just stop doing IO's to the RAID set, but that's a 
different problem).

> May  8 16:10:47 gamera /netbsd: siop1: scsi bus reset
> May  8 16:10:47 gamera /netbsd: cmd 0xc06700c0 (target 0:0) in reset list
> 
> =2E..and then it crashed.  The console had some message about RAIDframe
> being unable to allocate a DAG.  I didn't write it down or get a
> backtrace, because I knew it would make a core dump.  :-/

Writing it down would not have caused a core dump, and would have helped 
confirmed what I suspect happened.  Basically when 2 disks in a RAID 5 set 
fail, RAIDframe gives up.  And by the looks of it, the machine had some 
serious SCSI problems, errors were returned to RAIDframe, RAIDframe marked the 
components as failed, and when more than one component failed, RAIDframe said 
"enough".
 
> Problem 2:  I'd like to get raid1 back up again, but it won't
> configure:
> May  8 16:38:31 gamera /netbsd: raidlookup on device: /dev/sd4e failed!
> May  8 16:38:31 gamera /netbsd: Hosed component: /dev/sd4e

So sd4 is no longer on the system?  Oh... looks like it wasn't there 
before on May 7???  (Unless the logs you have here are incomplete...)

> May  8 16:38:31 gamera /netbsd: raid1: Too many different mod counters!
> May  8 16:38:31 gamera /netbsd: raid1: Component /dev/sd2e being configured=
>  at row: 0 col: 0
> May  8 16:38:31 gamera /netbsd:          Row: 0 Column: 0 Num Rows: 1 Num C=
> olumns: 3
> May  8 16:38:31 gamera /netbsd:          Version: 2 Serial Number: 20010507=
> 00 Mod Counter: 35
> May  8 16:38:31 gamera /netbsd:          Clean: No Status: 0
> May  8 16:38:31 gamera /netbsd: /dev/sd2e is not clean!
> May  8 16:38:31 gamera /netbsd: raid1: Component /dev/sd3e being configured=
>  at row: 0 col: 1
> May  8 16:38:31 gamera /netbsd:          Row: 0 Column: 1 Num Rows: 1 Num C=
> olumns: 3
> May  8 16:38:31 gamera /netbsd:          Version: 2 Serial Number: 20010507=
> 00 Mod Counter: 36
> May  8 16:38:31 gamera /netbsd:          Clean: No Status: 0
> May  8 16:38:31 gamera /netbsd: /dev/sd3e has a different modfication count=
> : 35 36
> May  8 16:38:31 gamera /netbsd: /dev/sd3e is not clean!
> May  8 16:38:31 gamera /netbsd: raid1: Ignoring /dev/sd4e
> May  8 16:38:31 gamera /netbsd: raid1: There were fatal errors
> May  8 16:38:31 gamera /netbsd: Closing vnode for row: 0 col: 0
> May  8 16:38:31 gamera /netbsd: Closing vnode for row: 0 col: 1
> May  8 16:38:31 gamera /netbsd: Closing vnode for row: 0 col: 2
> May  8 16:38:31 gamera /netbsd: vnode was NULL
> May  8 16:38:31 gamera /netbsd: RAIDFRAME: failed rf_ConfigureDisks with 22
> May  8 16:38:31 gamera /netbsd: Closing vnode for row: 0 col: 0
> May  8 16:38:31 gamera /netbsd: vnode was NULL
> May  8 16:38:31 gamera /netbsd: Closing vnode for row: 0 col: 1
> May  8 16:38:31 gamera /netbsd: vnode was NULL
> May  8 16:38:31 gamera /netbsd: Closing vnode for row: 0 col: 2
> May  8 16:38:31 gamera /netbsd: vnode was NULL
> 
> Now I could certainly force it to configure, in spite of the different
> mod counters, but I'd rather wait on advice from persons more
> knowledgeable.

About all you can do at this point is force it to configure, and hope for the 
best.  With sd4 unavailable, you should still be able to get sd2 and sd3 up in 
degraded mode.  If sd4 really was on-line before, what would have happened here 
is:
 1) sd2 had an error, and so would have been marked as "failed".  At that time 
the component labels for sd3 and sd4 would be updated.  (component label for 
sd2 would no be touched).

 2) sd3 then had an error, and so it would have been marked as "failed" too.
I'd need to double-check the code, but I don't think it would even get to the 
point of writing out component labels -- RAIDframe would just stop.

In that case, you really want sd3 and sd4 in the set, not sd2 and sd3, as sd3 
and sd4 would have the most up-to-date data.  If, however, sd4 was *not* in 
the system and you were running in degraded mode from the get-go, then you can 
just use sd2 and sd3, and things should be reasonably ok.  (there will likely 
be some filesystem lossage, but hopefully not much.)  Once we know whether sd4 
was there or not we'll have a better idea of what components you want to 
forcibly configure...

> Problem 3:  After reboot, I didn't get a core dump.  

"Dunno."

> Any help in solving any of these three problems would be much
> appreciated, and might even contribute to the other two.  :)

See above :)

Later...

Greg Oster