Subject: Re: Finding lost components in a raid set?
To: Johan Ihren <johani@autonomica.se>
From: Greg Oster <oster@cs.usask.ca>
List: current-users
Date: 02/07/2002 14:50:48
Johan Ihren writes:
> I've had lots of fun playing with RAIDframe lately and quite soon I
> switched to autoconfigured raid devices to "protect" the device from
> component renumbering.
> 
> Since Murphy is the one that is really in control, a Promise IDE
> controller card just failed. It took several attempted reboots, card
> rearrangements and cable replacements to make exactly sure that it
> indeed was the controller card that was at fault.
> 
> The Promise card is now removed, the disks rearranged on the remaining
> IDE channels and I want to get my raid devices back (two RAID5
> devices, both 2+1 with raid0 being a small one for experiments and
> raid1 being 120GB with 100+GB live data).
> 
> Unfortunately "raidctl -s" reports one component as completely missing
> (for both sets):
> 
> Components:
>         component0: failed
>          /dev/wd0a: optimal
>          /dev/wd1a: optimal
> 
> This is *not* because the disk failed. The disks are fine, brand new
> and no problems. I fixed my small raid0 device according to the manual:
> 
> raidctl -a /dev/wd2a raid0      (i.e. add wd2a *again*, since it was lost)
> raidctl -F component0 raid0     
> 
> For raid1 I tried to be more clever (since it has a significant amount
> of data on it), so I switched off autoconfig first, reordered the
> disks in /etc/raid1.conf and rebooted in the hope that the "raidctl -c
> /etc/raid1.conf raid1" during boot would find the missing component
> even though autoconfig didn't. Didn't work. I ended up with:
> 
> Components:
>         /dev/wd2g: failed
>         /dev/wd0g: optimal
>         /dev/wd1g: optimal
> 
> Here I gave up and initiated a re-construction of my raid1 with a
> "raidctl -R /dev/wd2g raid1" that will take about two hours to
> complete.
> 
> Basically it seems that everything has worked very nicely and I'm
> really pleased.
> 
> But, since I have no indication whatsoever that the disks should have
> failed in any way, I cannot help being a bit curious about *both* raid
> sets losing a component on the *same* disk without the disk being bad.

I don't suppose you have a copy of the disklabel for that disk, do you?
The autoconfig should have picked it up wd2a if it had a disklabel type of 
FS_RAID (unless the component label got corrupted).  'raidctl -c' should have 
found a valid component label for wd2g -- the fact that it's marked as failed 
indicates that it likely didn't.  

> I assume that the autoconfigured raid devices keep their configs in
> the component labels 

The configuration info is in the component labels, yes.

> or thereabouts and I don't understand how both of
> them can have been trashed at the same time. 

I don't get that either... Was wd2 on the controller card that failed?  Maybe 
it managed to zero the component labels on you??  Seems unlikely, but that's 
the only explanation I can think of so far... 

> And if the on-disk
> configs were "bad", why didn't a "raidctl -c" fix that, if the media
> is ok?

"raidctl -c" uses the component labels to verify that you have the components 
listed in the right order... If a component label is missing (or badly 
out-of-sync with the remaining components), then that component will be marked
as failed.  If you managed to get components in the wrong order, the RAID set 
shouldn't even configure.  (It tries quite hard to make sure you don't mess up 
the ordering, but it needs valid component labels in order to do that. :) )

Later...

Greg Oster