Subject: RAIDFrame trouble with reconstructing disks
To: None <netbsd-help@netbsd.org>
From: Matthias Buelow <mkb@mukappabeta.de>
List: netbsd-help
Date: 03/31/2001 05:50:01
Hi folks,

after I have replaced a failing disk I have serious trouble with
raidframe.  The situation is as following:

NetBSD 1.5/i386, 2x IBM DDRS 4.5 GB UW and 3x IBM DNES 9.1 GB UW
disks.  The two DDRS (sd0e, sd1e) are configured as RAID1 (mirroring)
and the 3 DNES (sd2e, sd3e, sd4e) as RAID5.  SCSI IDs start with
1 (not 0) and are assigned (and hard-coded in the kernel) to the
devices in ascending order (sd0 is target 1, sd1 is target 2 etc.)
As I get it, the idea of the person who set it up that way was, that
in case of utter failure, one could still plug in a disk with ID 0
for easier rescue operations.
The machine has the root fs on raid0 (the RAID1) and a mailspool
on raid1 (the RAID5, no comments about mailspools on RAID5, please,
the machine doesn't have any performance problems with that).
Additionally, there're minimal installations in sd0a/sd0b (identical)
for initial booting of the kernel and as an emergency system.

Now, sd1 (in the mirroring raid0) has failed (looks like some of
the disk electronics went bozo) but it didn't affect general
operation of the system apart from the occasional messages that
target 2 timed out, followed by bus resets.  The system continued
to work like normal, as expected with a redundant architecture.
Tonight, I replaced the failed disk with a new one of the same
sort, disklabelled it and reconstructed to the failed component
with:
 * un-autoconfigured raid0 via "raidctl -A no raid{0,1}" (worked)
   (raid0 was autoconfigured with root before) to have the
   system use sd0a as root fs, not raid0 and not autoconfigure raid1
   either,
 * reboot to single user mode,
 * configured the raids with raidctl -c /etc/raid{0,1}.conf raid{0,1}
   (worked),
 * raidctl -R /dev/sd1e raid0 (reconstruction, worked),
 * raidctl -P raid0 (rewrite of parity, worked),
 * raid0 both components said "optimal", parity clean.  Good so far.
   Because parity of raid1 was also dirty, I rewrote parity on raid1
   aswell, also successfully.
 * autoconfigured raid0 with root for normal operation with
   raidctl -A root raid0 (and autoconfigured raid1 w/o root aswell),
 * reboot.
 * BOOM: raid0 comes up with component 0 (/dev/sd0e) "optimal",
   however component 1 (/dev/sd1e) wasn't even displayed, instead
   raidctl -s says "component1: failed".  It doesn't even recognize
   that component 1 is sd1e, although everything was ok before the
   reboot!
 * raid1 parity is DIRTY, it was clean before the reboot (I simply
   rebooted from single user with the "reboot" command.)
 * I have attempted this now several times, and finally got fed up
   and let it boot through to multiuser, where it started rewriting
   parity of raid1 which is now completed (and clean).
   raid0 still shows "component1: failed" and parity is DIRTY.
   An attempt to rewrite parity on raid0 says
   "raid0: Error re-writing parity!" and that was it.

Now I'm stuck here.  The system is a production mailserver and I'd
prefer it to work correctly although mail operations is working right
now.
Have I done the proper steps in the right order?  Or should something
else be done here?

Here's the relevant data:

***** dmesg output for disks:

sd0 at scsibus0 target 1 lun 0: <IBM, DDRS-34560D, DC1B> SCSI2 0/direct fixed
siop0: target 1 using 16bit transfers
siop0: target 1 now synchronous at 20.0Mhz, offset 15
sd0: 4357 MB, 8387 cyl, 5 head, 212 sec, 512 bytes/sect x 8925000 sectors
sd1 at scsibus0 target 2 lun 0: <IBM, DDRS-34560W, S71D> SCSI2 0/direct fixed
siop0: target 2 using 16bit transfers
siop0: target 2 now synchronous at 20.0Mhz, offset 15
sd1: 4357 MB, 8387 cyl, 5 head, 212 sec, 512 bytes/sect x 8925000 sectors
sd2 at scsibus0 target 3 lun 0: <IBM, DNES-309170W, SA30> SCSI3 0/direct fixed
siop0: target 3 using 16bit transfers
siop0: target 3 now synchronous at 20.0Mhz, offset 16
sd2: 8748 MB, 11474 cyl, 5 head, 312 sec, 512 bytes/sect x 17916240 sectors
sd3 at scsibus0 target 4 lun 0: <IBM, DNES-309170W, SA30> SCSI3 0/direct fixed
siop0: target 4 using 16bit transfers
siop0: target 4 now synchronous at 20.0Mhz, offset 16
sd3: 8748 MB, 11474 cyl, 5 head, 312 sec, 512 bytes/sect x 17916240 sectors
sd4 at scsibus0 target 5 lun 0: <IBM, DNES-309170W, SA30> SCSI3 0/direct fixed
siop0: target 5 using 16bit transfers
siop0: target 5 now synchronous at 20.0Mhz, offset 16
sd4: 8748 MB, 11474 cyl, 5 head, 312 sec, 512 bytes/sect x 17916240 sectors

***** raid0.conf:

START array
1 2 0

START disks
/dev/sd0e
/dev/sd1e

START layout
# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level
128 1 1 1

START queue
fifo 100

***** raid1.conf:

START array
1 3 1

START disks
/dev/sd2e
/dev/sd3e
/dev/sd4e

#START spare
#/dev/sd5e

START layout
# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level
32 1 1 5

START queue
fifo 100

***** raidctl -s raid0 output:

Components:
           /dev/sd0e: optimal
          component1: failed
No spares.
Component label for /dev/sd0e:
   Row: 0 Column: 0 Num Rows: 1 Num Columns: 2
   Version: 2 Serial Number: 273645 Mod Counter: 1514107706
   Clean: No Status: 0
   sectPerSU: 128 SUsPerPU: 1 SUsPerRU: 1
   RAID Level: 1  blocksize: 512 numBlocks: 7974016
   Autoconfig: Yes
   Root partition: Yes
   Last configured as: raid0
component1 status is: failed.  Skipping label.
Parity status: DIRTY
Reconstruction is 100% complete.
Parity Re-write is 100% complete.
Copyback is 100% complete.

***** raidctl -s raid1 output:

Components:
           /dev/sd2e: optimal
           /dev/sd3e: optimal
           /dev/sd4e: optimal
No spares.
Component label for /dev/sd2e:
   Row: 0 Column: 0 Num Rows: 1 Num Columns: 3
   Version: 2 Serial Number: 314159 Mod Counter: 436
   Clean: No Status: 0
   sectPerSU: 32 SUsPerPU: 1 SUsPerRU: 1
   RAID Level: 5  blocksize: 512 numBlocks: 17916160
   Autoconfig: Yes
   Root partition: No
   Last configured as: raid1
Component label for /dev/sd3e:
   Row: 0 Column: 1 Num Rows: 1 Num Columns: 3
   Version: 2 Serial Number: 314159 Mod Counter: 436
   Clean: No Status: 0
   sectPerSU: 32 SUsPerPU: 1 SUsPerRU: 1
   RAID Level: 5  blocksize: 512 numBlocks: 17916160
   Autoconfig: Yes
   Root partition: No
   Last configured as: raid1
Component label for /dev/sd4e:
   Row: 0 Column: 2 Num Rows: 1 Num Columns: 3
   Version: 2 Serial Number: 314159 Mod Counter: 436
   Clean: No Status: 0
   sectPerSU: 32 SUsPerPU: 1 SUsPerRU: 1
   RAID Level: 5  blocksize: 512 numBlocks: 17916160
   Autoconfig: Yes
   Root partition: No
   Last configured as: raid1
Parity status: clean
Reconstruction is 100% complete.
Parity Re-write is 100% complete.
Copyback is 100% complete.

***** raidctl -P raid0 output:

/dev/raid0d: Parity status: DIRTY
/dev/raid0d: Initiating re-write of parity
Mar 31 05:44:13 hostname /netbsd: raid0: Error re-writing parity!
/dev/raid0d: Parity Re-write complete


The disks in question are accessible and don't show any errors.
Disklabels are ok, sd0 and sd1 disks are identical and have the
same labels:

...
8 partitions:
#        size   offset     fstype   [fsize bsize   cpg]
  a:   819317       63     4.2BSD     1024  8192    16   # (Cyl.    0*- 772)
  b:   131440   819380       swap                        # (Cyl.  773 - 896)
  c:  8924937       63     unused        0     0         # (Cyl.    0*- 8419*)
  d:  8925000        0     unused        0     0         # (Cyl.    0 - 8419*)
  e:  7974180   950820       RAID                        # (Cyl.  897 - 8419*)

Maybe someone also has encountered these problems or is generally
more experienced with failing disks on RAIDFrame and could help me out?
Thanks.

mkb