current-users: Re: Finding lost components in a raid set?

Subject: Re: Finding lost components in a raid set?
To: Johan Ihren <johani@autonomica.se>
From: Greg Oster <oster@cs.usask.ca>
List: current-users
Date: 02/07/2002 16:37:16
Johan Ihren writes:
> Greg Oster <oster@cs.usask.ca> writes:
> 
> > Johan Ihren writes:
> > > Greg Oster <oster@cs.usask.ca> writes:
> > > 
> > > Hi Greg,
> > > 
> > > > > Basically it seems that everything has worked very nicely and I'm
> > > > > really pleased.
> > > 
> > > Unfortunately I spoke too soon. 
> > > 
> > > My re-construction of the "failed" component with a "raidctl -R
> > > /dev/wd2g raid1" just finished and raidctl -s now reports happiness.
> > > 
> > > But the disklabel for raid1 is zeroed.
> > 
> > :( Hmmm...
> > 
> > > How can that happen? I had two components out of three intact at all
> > > times for a 2+1 RAID5 device and I see no reason to lose the label.
> > 
> > The label should have been quite recoverable... in fact, it should have bee
> n 
> > there even with just 2 components....
> > 
> > > I have to admit that I did *not* keep a copy of that label in a safe
> > > place, which in retrospect seems rather stupid.
> > 
> > Have a look in /var/backups :)
> 
> Whoever put disklabels under RCS control in /var/backups deserves
> eternal gratitude. Brilliant! Wonderful!

You'll find another thread (Titled: "*whew*" on -current users, which is 
exactly about this :) )

> However that did not save the day in my particular case. I did indeed
> find the label and I know it was the right label, since I used
> nonstandard parameters to newfs:
> 
> 8 partitions:
> #        size    offset     fstype  [fsize bsize cpg/sgs]
>  d: 199999872         0     4.2BSD      0     0     0   # (Cyl.    0 - 260416
> *)
>  e: 199999872         0     4.2BSD   4096 32768   256   # (Cyl.    0 - 260416
> *)
> 
> However, life as we know it was no longer to be found on the raid1
> planet. This was once obviously a seriously damaged animal. And now it
> is dead:

:(

> 
> bash# fsck /dev/rraid1e
> ** /dev/rraid1e
> ** File system is clean; not checking
> bash# mount /dev/raid1e /usr/raid/raid1 
> bash# df | grep raid1e
> /dev/raid1e  99543824        4  94566628     0%    /usr/raid/raid1
> bash# ls /usr/raid/raid1 
> ls: /usr/raid/raid1: Bad file descriptor
> bash# file /usr/raid/raid1 
> /usr/raid/raid1: can't stat `/usr/raid/raid1' (Bad file descriptor).

Yuck.  I suspect doing 'fsck -f /dev/rraid1e' would yield a whole bunch of 
lossage....

> Time to start over. I've had better evenings.
> 
> Johan
> 
> PS. During my permutations of disks and IDE controllers when trying to
> isolate the hardware problem the other raid components were likely at
> some time located on the bad Promise controller. I wasn't mindful of
> that since I didn't know that the controller was bad and because of
> autoconfig it didn't matter that disks were renumbered. And my guess
> is that no disk was safe when attached to that controller.

Ya...  all they need to do is scribble stuff over the 'good' disks, and it 
doesn't matter how much RAID you have... :(  Once 2 disks in the RAID 5 set 
got scribbled on (even randomly), it's game over :(

Later...

Greg Oster