Subject: Re: Finding lost components in a raid set?
To: Johan Ihren <johani@autonomica.se>
From: Greg Oster <oster@cs.usask.ca>
List: current-users
Date: 02/07/2002 15:41:56
Johan Ihren writes:
> Greg Oster <oster@cs.usask.ca> writes:
> 
> Hi Greg,
> 
> > > Basically it seems that everything has worked very nicely and I'm
> > > really pleased.
> 
> Unfortunately I spoke too soon. 
> 
> My re-construction of the "failed" component with a "raidctl -R
> /dev/wd2g raid1" just finished and raidctl -s now reports happiness.
> 
> But the disklabel for raid1 is zeroed.

:( Hmmm...

> How can that happen? I had two components out of three intact at all
> times for a 2+1 RAID5 device and I see no reason to lose the label.

The label should have been quite recoverable... in fact, it should have been 
there even with just 2 components....

> I have to admit that I did *not* keep a copy of that label in a safe
> place, which in retrospect seems rather stupid.

Have a look in /var/backups :)

> But I regarded the
> underlying device as "safe", in the sense that an event that manage to
> wipe out the label would wipe out the file system data also.
>
> I also have to admit that I am less happy than 30 minutes ago ;-(

Ya.. no kidding :(  
 
> Losing the disklabel after what should be considered a standard
> replacement of a failed component is not encouraging. But I really
> don't see where I did anything wrong.

You didn't... at least that I can tell.... (the only way you should have 
potentially lost anything here is with 'raidctl -C', and getting the order of 
the components wrong...)

> > > But, since I have no indication whatsoever that the disks should have
> > > failed in any way, I cannot help being a bit curious about *both* raid
> > > sets losing a component on the *same* disk without the disk being bad.
> > 
> > I don't suppose you have a copy of the disklabel for that disk, do
> > you?  The autoconfig should have picked it up wd2a if it had a
> > disklabel type of FS_RAID (unless the component label got
> > corrupted).  'raidctl -c' should have found a valid component label
> > for wd2g -- the fact that it's marked as failed indicates that it
> > likely didn't.
> 
> I have the label for wd2 (attached), I don't have the label for raid1,
> as I said above.
> 
> > > I assume that the autoconfigured raid devices keep their configs in
> > > the component labels 
> > 
> > The configuration info is in the component labels, yes.
> > 
> > > or thereabouts and I don't understand how both of
> > > them can have been trashed at the same time. 
> > 
> > I don't get that either... Was wd2 on the controller card that
> > failed?  Maybe it managed to zero the component labels on you??
> > Seems unlikely, but that's the only explanation I can think of so
> > far...
> 
> Yes, wd2 was on the controller that failed.

Hmmmmmm.

> > > And if the on-disk
> > > configs were "bad", why didn't a "raidctl -c" fix that, if the media
> > > is ok?
> > 
> > "raidctl -c" uses the component labels to verify that you have the
> > components listed in the right order... If a component label is
> > missing (or badly out-of-sync with the remaining components), then
> > that component will be marked as failed.  If you managed to get
> > components in the wrong order, the RAID set shouldn't even
> > configure.  (It tries quite hard to make sure you don't mess up the
> > ordering, but it needs valid component labels in order to do
> > that. :) )
> 
> I assume that this "component label" is stored adjacent to the device
> it describes. I.e. the component label for /dev/wd2g is located at the
> head of that physical partition?

Yes.. the first 32K is 'reserved'.  The component label lives 16K in from
the start of the partition.

> Here's the disklabel for wd2. It is definitely intact, since it is
> exactly the same as for wd0 and wd1:
snip.

Try re-labelling raid1 with the label from /var/backups.  Hopefully, however, 
whatever's been zeroing stuff on you won't have wrecked the filesystem too :(

Later...

Greg Oster