current-users: Re: Horrible RAIDFrame Crash

Subject: Re: Horrible RAIDFrame Crash
To: Greg Oster <oster@cs.usask.ca>
From: Caffeinate The World <mochaexpress@yahoo.com>
List: current-users
Date: 04/15/2003 08:33:23
--- Greg Oster <oster@cs.usask.ca> wrote:
> Caffeinate The World writes:
> > Alpha 1.6R CVS src from 4/13.
> > 
> > I was trying to build a raid1 set using sd0 and sd1. sd0 that had
> live
> > data on it. I succeeded in creating a temporary set using sd1 and a
> > fake nonexisting first component. 
> > 
> > sd1a raid0 /
> > sd1b raid1 swap
> > sd1d raid2 /var
> > sd1e raid3 /usr
> > 
> > installboot was used on sd1c. During reboot, I told the alpha to
> boot
> > from dkc200 where sd1 lived. It booted up fine with the regular
> > warnings regarding the fake component.
> > 
> > Then I disklabel sd0 with an exact duplicate of sd1 disklabel with
> only
> >  the "disk:" field in the disklabel different. installboot on sd0c
> went
> > fine.
> 
> Ummm... you said sd0 had live data on it... your 'new' disklabel
> didn't 
> mash any partitions that you wanted to keep, did it??? 

At this point raid1-3 have all the same data as sd0a,b,de respectively.

> > Then I tried to:
> > 
> > raidctl -a /dev/sd0a raid0 
> > raidctl -vF component0 raid0
> > ... all hell broke ...
> 
> In particular, sd0a was a 'free' partition on 0 that you weren't
> planning to 
> boot from ever again?

Right because all that data was copied to the raid set (sd1 and fake
component) which the system is running with right at this point.

> > I head a louder than usual 2 second grinding noise from the HDs in
> the
> > alpha. Then errors scrolled so fast I couldn't see. But I did see:
> > 
> > Multiple disks failed in a single group!  Aborting I/O operation.
> 
> At this point RAIDframe thinks that at least 2 disks are dead in one
> of the 
> RAID sets.  That is usually due to hardware failure.  (But if you
> booted 
> from sd0a and then changed the disklabel for sd0 and then tried using
> sd0a as a component, I'm not sure how much the kernel will like
> that...)

When I change the label on sd0, the system was running with the raid
set. I was only trying to add slices of sd0 to be spares for the raid
set to replace the fake components.

Before I did anything with sd0, I tested the system as it ran using the
raid set. Everything ran fine, so that's why I was confident enough to
proceed to using sd0 as a spare to replace the fake component.

Unless I'm the most unlucky guy in the world, that odds of having a
hardware failure while RAIDframe was reconstructing would be pretty
slim.

The disklabel for sd0 was changed only by replacing 4.2BSD and swap
with RAID. I kept the same layout as the old. Interestingly enough,
when I boot from a 1.6 install floppies, I could mount sd0d and sd0e
and see all the old data on it, while sd0a gave errors -- which make
sense since I was working with sd0a as a spare for raid0.

In my other email I mentioned the warning about truncating sd0a. Why
would it need to truncate anything if sd0a is the exact size of raid0?
That error got me concerned. In otherwords, the layout of sd0 and sd1
is exactly the same.

> > The alpha rebooted itself. I told it to use dkc200 (sd1) to boot
> from.
> > 1st and 2nd boot stage went fine. kernel was showing dmesg.. it
> showed
> > the SCSI drives sd0 and sd1.. then it showed the line:
> > 
> > Kernelized RAIDFrame activated
> > ... bunch  of fast scrolling errors that I couldn't see ...
> > 
> > I was able to ctrl-c out to >>> and toward the top it said not able
> to
> > find init.. and it tried to find init.bak but couldn't.
> 
> In another email you mention:
> > raid0: RAID Level 1
> > raid0: Components: /dev/sd0a /dev/sd1a...[screen cutoff]
> 
> It'd be *really* nice to know whether or not it said
>  
>   /dev/sd1a[**FAILED**]
 
Time for that high tech camcorder to come out again.

Yes it did say /dev/sd1a[**FAILED**]

> because that would mean that it's trying to use non-existent data
> bits from 
> sd0a instead of sd1a.  That would also likely mean that sd1a had a
> read error, 
> and ended up getting marked as failed.
> 
> > raid0: Total Sectors: 1023872 (499 MB)...
> ...
> > root on raid0a dumps on raid0b 
> >  NOTE: no such raid0b
> 
> 
>  
> > My questions:
> > 
> > 1. is my data gone?
> 
> Dunno.  I'm confused as to where the "real data" is and what parts 
> are gone... (i.e. there was data on sd0 that might be mashed, you
> were booting
> w/ root on raid0a, and the filesytem on raid0a appears to be
> corrupted, 
> and we're not sure of the hardware state of sd1 right now).  So there
> are
> at least 3 different places from where data may be "gone".
> 
> > 2. anyway to get the raid to boot again? ie. fix this problem?
> 
> You *might* be able to get it to light again by removing sd0 from the
> system
> and booting from sd1.
> 
> > 3. is this a nasty bug?
> 
> It sounds more like a read error on sd1a.  (A write error on sd0a
> shouldn't 
> have caused the "Multiple disks...." error.)  The "louder than usual
> 2 second 
> grinding noise from the HDs" points to a hardware problem.  
> 
> > 4. anyway to pause the screen from scrolling?
> 
> Not that I know of.
> 
> Later...
> 
> Greg Oster
> 
> 


__________________________________________________
Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo
http://search.yahoo.com