Subject: Re: RAIDFrame trouble with reconstructing disks
To: Matthias Buelow <mkb@mukappabeta.de>
From: Manuel Bouyer <bouyer@antioche.lip6.fr>
List: netbsd-help
Date: 03/31/2001 17:42:29
On Sat, Mar 31, 2001 at 05:50:01AM +0200, Matthias Buelow wrote:
> Hi folks,
> 
> after I have replaced a failing disk I have serious trouble with
> raidframe.  The situation is as following:
> 
> NetBSD 1.5/i386, 2x IBM DDRS 4.5 GB UW and 3x IBM DNES 9.1 GB UW
> disks.  The two DDRS (sd0e, sd1e) are configured as RAID1 (mirroring)
> and the 3 DNES (sd2e, sd3e, sd4e) as RAID5.  SCSI IDs start with
> 1 (not 0) and are assigned (and hard-coded in the kernel) to the
> devices in ascending order (sd0 is target 1, sd1 is target 2 etc.)
> As I get it, the idea of the person who set it up that way was, that
> in case of utter failure, one could still plug in a disk with ID 0
> for easier rescue operations.
> The machine has the root fs on raid0 (the RAID1) and a mailspool
> on raid1 (the RAID5, no comments about mailspools on RAID5, please,
> the machine doesn't have any performance problems with that).
> Additionally, there're minimal installations in sd0a/sd0b (identical)
> for initial booting of the kernel and as an emergency system.
> 
> Now, sd1 (in the mirroring raid0) has failed (looks like some of
> the disk electronics went bozo) but it didn't affect general
> operation of the system apart from the occasional messages that
> target 2 timed out, followed by bus resets.  The system continued
> to work like normal, as expected with a redundant architecture.
> Tonight, I replaced the failed disk with a new one of the same
> sort, disklabelled it and reconstructed to the failed component
> with:
>  * un-autoconfigured raid0 via "raidctl -A no raid{0,1}" (worked)
>    (raid0 was autoconfigured with root before) to have the
>    system use sd0a as root fs, not raid0 and not autoconfigure raid1
>    either,
>  * reboot to single user mode,
>  * configured the raids with raidctl -c /etc/raid{0,1}.conf raid{0,1}
>    (worked),
>  * raidctl -R /dev/sd1e raid0 (reconstruction, worked),
>  * raidctl -P raid0 (rewrite of parity, worked),
>  * raid0 both components said "optimal", parity clean.  Good so far.
>    Because parity of raid1 was also dirty, I rewrote parity on raid1
>    aswell, also successfully.
>  * autoconfigured raid0 with root for normal operation with
>    raidctl -A root raid0 (and autoconfigured raid1 w/o root aswell),
>  * reboot.
>  * BOOM: raid0 comes up with component 0 (/dev/sd0e) "optimal",
>    however component 1 (/dev/sd1e) wasn't even displayed, instead
>    raidctl -s says "component1: failed".  It doesn't even recognize
>    that component 1 is sd1e, although everything was ok before the
>    reboot!

As Greg said, this is a bug in 1.5, which is fixed in -current and 1.5.1_ALPHA
(run into it as well); you can work around with raidctl -I.

However I'd like to add that you don't have to go through all these steps to
remplace a disk. Once you've remplaced the disk, reboot with root on raid0.
raid0 will come up with 'component1: failed'. You can then disklabel your
new disk and add it as hot-spare for raid0:
raidctl -a /dev/sd1e raid0
then reconstruct failed component:
raidctl -F component1 raid0
Then you can recontruct parity.

Now raidctl -s will still show 'component1: failed', and will notice:
'/dev/sd1e: used spare'. modulo the bug which doesn't properly update
something in the raidframe header of the new disk (and worked around with
raidctl -I), the next reboot will give you a raid0 in 'normal' state, with
both sd0e and sd1e marked as 'optimal'.

This works fine; and it's how I install machines with root on raid-1 now:
do a quick install of NetBSD on sd0; use this one to set up the raid-1
on sd1 and nonexistent sd2. The raid will configure but mark sd2 as failed
(expected as it doesn't exists). However it's enouth to install NetBSD
on the raid-1, and mark is a autoconfigure,root. Then reboot on sd1,
the system will boot with root on raid0.  raidctl -s will show raid0 with
component1 failed. disklabel sd0 and use the above steps to reconstruct
component1 on sd0. voila, you've got a system with root on raid-1 :)

--
Manuel Bouyer <bouyer@antioche.eu.org>
--