Subject: Re: Horrible RAIDFrame Crash
To: Caffeinate The World <mochaexpress@yahoo.com>
From: Greg Oster <oster@cs.usask.ca>
List: current-users
Date: 04/15/2003 17:50:07
Caffeinate The World writes:
> 
> --- Caffeinate The World <mochaexpress@yahoo.com> wrote:
> > 
> > --- Caffeinate The World <mochaexpress@yahoo.com> wrote:
> > > 
> > > --- Caffeinate The World <mochaexpress@yahoo.com> wrote:
> > > I unplugged the SCSI connector from sd0 and booted the system up
> > > again.
> > > It booted up fine with the failed component errors. So sd1 is fine.
> > 
> > > 
> > > What can I do to further narrow down the problem. Apparantly it's
> > sd0
> > > and it could be during the write process that caused the Multiple
> > > disks
> > > error. I get the feeling that if I repeat building sd0 as the
> > spare,
> > > I'll get the same errors.
> > 
> > I unplugged the SCSI cable from sd0, boot up the system. Booted up
> > fine. Shutdown to single user mode. Plug the SCSI cable back into sd0
> > and "scsictl scsibus0 scan any any". It found sd0 fine.
> > 
> > Tried to get sd0a to hotspare with raid0 again.
> > 
> > raidctl -a /dev/sd0a raid0
> > warning: truncating spare disk /dev/sd0a to 1023872 blocks
> > 
> > NOTE: sd0a has the same layout and size as sd1a used by raid0. So
> > that
> > truncating error doesn't make sense.
> 
> I checked the disklabel of sd0c and it showed partition a: having
> 1024000 sectors with offset 0. sd01a is exactly the same. raid0 (which
> is a raid1 composing of sd1a and sd0a) has 1023872 sectors with offset
> 0. Note that 1023872 was given by disklabel raid0 > disklabel.raid0
> before. 1024000 - 1023872 is 128, which happens to be raid0's sectors
> per track. 
> 
> Why is using 1023872? is the 128 reserved for the raid disk label?

64 are reserved for the RAIDframe component label.  The actual number of 
blocks used will be just a multiple of 128 (which 1023872 is.)

> > raidctl -vF component0 raid0
> > started doing the reconstruction and was at 2% when
> > ...fast scrolling errors... then
> > 
> > recon read failed
> > panic: raidframe error at line 1314 file
> > /usr/src/sys/dev/raidframe/rf_reconstruct.c
> > syncing disks... Multiple disks failed in a single group! Aborting
> > I/O
> > operation
> > 
> > Multiple disks failed...operation [repeated 17 times]
> > 
> > panic raidframe error at line 471 file
> > /usr/src/sys/dev/raidframe/rf_states.c
> 
> I was curious if sd0a had read write problems. So I changed sd0a from
> type RAID to 4.2BSD. newfs sd0a and it went fine. fsck sd0a had no
> errors. After that successful test, I changed sd0a type back to RAID in
> the disklabel (everything else remain the same).

newfs doesn't touch every block, and so isn't going to catch all media errors.

> Trying to get sd0a to be spare for raid0 again:
> 
> raidctl -a /dev/sd0a raid0
> ...truncation warning...
> raidctl -vF component0 raid0
> ...reconstruct at around 7%...
> ...two quick semi-loud ZZZzzz ZZZzzz sound from the HD...
> ...crash...reboot...

The crash part arguably shouldn't happen, but RAIDframe doesn't deal very well
with degraded sets that have component failures.

> On a positive note, I finally got a chance to take the alpha down and
> replace the cmos battery so it would keep proper time and cmos settings
> between long (like 2 min) of being shut off. It also helped with not
> having to get into AlphaBIOS for NT each time the system power cycles.
> You'd have to go into the cmos and set OpenVMS again, then reboot just
> to get it to boot. CR2032 lithium battery cost $2.99 at Kmart and
> RadioShack.
> 
> Thomas
> 
> __________________________________________________
> Do you Yahoo!?
> The New Yahoo! Search - Faster. Easier. Bingo
> http://search.yahoo.com

Later...

Greg Oster