netbsd-help: Re: RAIDFrame trouble with reconstructing disks

Subject: Re: RAIDFrame trouble with reconstructing disks
To: Matthias Buelow <mkb@mukappabeta.de>
From: Greg Oster <oster@cs.usask.ca>
List: netbsd-help
Date: 03/30/2001 23:16:05
Matthias Buelow writes:
> Hi folks,
> 
> after I have replaced a failing disk I have serious trouble with
> raidframe.  The situation is as following:
> 
> NetBSD 1.5/i386, 2x IBM DDRS 4.5 GB UW and 3x IBM DNES 9.1 GB UW
> disks.  The two DDRS (sd0e, sd1e) are configured as RAID1 (mirroring)
> and the 3 DNES (sd2e, sd3e, sd4e) as RAID5.  SCSI IDs start with
> 1 (not 0) and are assigned (and hard-coded in the kernel) to the
> devices in ascending order (sd0 is target 1, sd1 is target 2 etc.)
> As I get it, the idea of the person who set it up that way was, that
> in case of utter failure, one could still plug in a disk with ID 0
> for easier rescue operations.
> The machine has the root fs on raid0 (the RAID1) and a mailspool
> on raid1 (the RAID5, no comments about mailspools on RAID5, please,
> the machine doesn't have any performance problems with that).
> Additionally, there're minimal installations in sd0a/sd0b (identical)
> for initial booting of the kernel and as an emergency system.
> 
> Now, sd1 (in the mirroring raid0) has failed (looks like some of
> the disk electronics went bozo) but it didn't affect general
> operation of the system apart from the occasional messages that
> target 2 timed out, followed by bus resets.  The system continued
> to work like normal, as expected with a redundant architecture.
> Tonight, I replaced the failed disk with a new one of the same
> sort, disklabelled it and reconstructed to the failed component
> with:
>  * un-autoconfigured raid0 via "raidctl -A no raid{0,1}" (worked)
>    (raid0 was autoconfigured with root before) to have the
>    system use sd0a as root fs, not raid0 and not autoconfigure raid1
>    either,
>  * reboot to single user mode,
>  * configured the raids with raidctl -c /etc/raid{0,1}.conf raid{0,1}
>    (worked),
>  * raidctl -R /dev/sd1e raid0 (reconstruction, worked),
>  * raidctl -P raid0 (rewrite of parity, worked),

You've run into a bug in 1.5 that is (supposed to be) fixed in -current
and in (what will be) 1.5.1.  After you do the "-R", do a

  raidctl -I 123456 raid0

and then you shouln't experience the problem you're seeing.
(basically one of the fields doesn't get initialized correctly
in the component label on the reconstruct.  The "-I" does set this field
correctly, and running it again like this doesn't affect any data.)

>  * raid0 both components said "optimal", parity clean.  Good so far.
>    Because parity of raid1 was also dirty, I rewrote parity on raid1
>    aswell, also successfully.
>  * autoconfigured raid0 with root for normal operation with
>    raidctl -A root raid0 (and autoconfigured raid1 w/o root aswell),
>  * reboot.
>  * BOOM: raid0 comes up with component 0 (/dev/sd0e) "optimal",
>    however component 1 (/dev/sd1e) wasn't even displayed, instead
>    raidctl -s says "component1: failed".  It doesn't even recognize
>    that component 1 is sd1e, although everything was ok before the
>    reboot!

Yup.. that's the typical symptom.

[snip]
> Maybe someone also has encountered these problems or is generally
> more experienced with failing disks on RAIDFrame and could help me out?

"See above."  'raidctl -I 12345 raid0' after the 'raidctl -R' will fix things 
for you.  

[BTW: Nice to see RAIDframe in use on a production box... and sorry about 
the bug :-/ ]

Later...

Greg Oster