current-users: Re: Horrible RAIDFrame Crash

Subject: Re: Horrible RAIDFrame Crash
To: Caffeinate The World <mochaexpress@yahoo.com>
From: Greg Oster <oster@cs.usask.ca>
List: current-users
Date: 04/15/2003 07:48:03
Caffeinate The World writes:
> Alpha 1.6R CVS src from 4/13.
> 
> I was trying to build a raid1 set using sd0 and sd1. sd0 that had live
> data on it. I succeeded in creating a temporary set using sd1 and a
> fake nonexisting first component. 
> 
> sd1a raid0 /
> sd1b raid1 swap
> sd1d raid2 /var
> sd1e raid3 /usr
> 
> installboot was used on sd1c. During reboot, I told the alpha to boot
> from dkc200 where sd1 lived. It booted up fine with the regular
> warnings regarding the fake component.
> 
> Then I disklabel sd0 with an exact duplicate of sd1 disklabel with only
>  the "disk:" field in the disklabel different. installboot on sd0c went
> fine.

Ummm... you said sd0 had live data on it... your 'new' disklabel didn't 
mash any partitions that you wanted to keep, did it??? 

> Then I tried to:
> 
> raidctl -a /dev/sd0a raid0 
> raidctl -vF component0 raid0
> ... all hell broke ...

In particular, sd0a was a 'free' partition on 0 that you weren't planning to 
boot from ever again?
 
> I head a louder than usual 2 second grinding noise from the HDs in the
> alpha. Then errors scrolled so fast I couldn't see. But I did see:
> 
> Multiple disks failed in a single group!  Aborting I/O operation.

At this point RAIDframe thinks that at least 2 disks are dead in one of the 
RAID sets.  That is usually due to hardware failure.  (But if you booted 
from sd0a and then changed the disklabel for sd0 and then tried using
sd0a as a component, I'm not sure how much the kernel will like that...)
 
> The alpha rebooted itself. I told it to use dkc200 (sd1) to boot from.
> 1st and 2nd boot stage went fine. kernel was showing dmesg.. it showed
> the SCSI drives sd0 and sd1.. then it showed the line:
> 
> Kernelized RAIDFrame activated
> ... bunch  of fast scrolling errors that I couldn't see ...
> 
> I was able to ctrl-c out to >>> and toward the top it said not able to
> find init.. and it tried to find init.bak but couldn't.

In another email you mention:
> raid0: RAID Level 1
> raid0: Components: /dev/sd0a /dev/sd1a...[screen cutoff]

It'd be *really* nice to know whether or not it said
 
  /dev/sd1a[**FAILED**]

because that would mean that it's trying to use non-existent data bits from 
sd0a instead of sd1a.  That would also likely mean that sd1a had a read error, 
and ended up getting marked as failed.

> raid0: Total Sectors: 1023872 (499 MB)...
...
> root on raid0a dumps on raid0b 
>  NOTE: no such raid0b


 
> My questions:
> 
> 1. is my data gone?

Dunno.  I'm confused as to where the "real data" is and what parts 
are gone... (i.e. there was data on sd0 that might be mashed, you were booting
w/ root on raid0a, and the filesytem on raid0a appears to be corrupted, 
and we're not sure of the hardware state of sd1 right now).  So there are
at least 3 different places from where data may be "gone".

> 2. anyway to get the raid to boot again? ie. fix this problem?

You *might* be able to get it to light again by removing sd0 from the system
and booting from sd1.

> 3. is this a nasty bug?

It sounds more like a read error on sd1a.  (A write error on sd0a shouldn't 
have caused the "Multiple disks...." error.)  The "louder than usual 2 second 
grinding noise from the HDs" points to a hardware problem.  

> 4. anyway to pause the screen from scrolling?

Not that I know of.

Later...

Greg Oster