current-users: Re: Horrible RAIDFrame Crash

Subject: Re: Horrible RAIDFrame Crash
To: Caffeinate The World <mochaexpress@yahoo.com>
From: Greg Oster <oster@cs.usask.ca>
List: current-users
Date: 04/15/2003 13:35:38
Caffeinate The World writes:
> 
> --- Caffeinate The World <mochaexpress@yahoo.com> wrote:
> > 
> > --- Caffeinate The World <mochaexpress@yahoo.com> wrote:
> > I unplugged the SCSI connector from sd0 and booted the system up
> > again.
> > It booted up fine with the failed component errors. So sd1 is fine. 
> > 
> > What can I do to further narrow down the problem. Apparantly it's sd0
> > and it could be during the write process that caused the Multiple
> > disks
> > error. I get the feeling that if I repeat building sd0 as the spare,
> > I'll get the same errors.
> 
> I unplugged the SCSI cable from sd0, boot up the system. Booted up
> fine. Shutdown to single user mode. Plug the SCSI cable back into sd0
> and "scsictl scsibus0 scan any any". It found sd0 fine.
> 
> Tried to get sd0a to hotspare with raid0 again.
> 
> raidctl -a /dev/sd0a raid0
> warning: truncating spare disk /dev/sd0a to 1023872 blocks
> 
> NOTE: sd0a has the same layout and size as sd1a used by raid0. So that
> truncating error doesn't make sense.

What happens is that RAIDframe 'truncates' the component down to a multiple of 
the stripe width.  So it probably truncated the component on sd1 as well. 
Not a problem.

> raidctl -vF component0 raid0
> started doing the reconstruction and was at 2% when
> ...fast scrolling errors... then
> 
> recon read failed
> panic: raidframe error at line 1314 file
> /usr/src/sys/dev/raidframe/rf_reconstruct.c

This *is* an error in reading from a block on sd1.  You might try 
doing:

 dd if=/dev/rsd1d of=/dev/null bs=1m 

and see whether that errors out too.  You might also check to see if 
any of the logs in /var/log have a mention of a failing read.

> syncing disks... Multiple disks failed in a single group! Aborting I/O
> operation

Yup.. as far as RAIDframe is concerned it can't do anything with that RAID set 
after another disk (the last and only disk, in this case) failed.

> Multiple disks failed...operation [repeated 17 times]
> 
> panic raidframe error at line 471 file
> /usr/src/sys/dev/raidframe/rf_states.c
> 
> P.S. I started the  NetBSD nightmare thread FFS2, I guess this is the
> sequel: NetBSD Nightmare II. :(

No... It's most likely "The Hardware Nightmare" :(

[I'll answer your other postings later this evening...]

Later...

Greg Oster