Subject: Re: RAIDframe crash
To: Greg Oster <>
From: Chris Jones <>
List: current-users
Date: 05/08/2001 17:41:27
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

Thanks for the verbose help, Greg.

On Tue, May 08, 2001 at 05:20:58PM -0600, Greg Oster wrote:

> The date jumps here... is there data missing?  If so, at least some of it=
> critical to solving this...=20

Yeah; sorry for my incomplete report.  As you guessed later on, sd4
wasn't in the system at all.  I was basically testing the whole thing
to see how well it's going to work, before I put it into production.
So I've got two out of my three disks online, and I'm thrashing the

I guess I was hoping that, while the RAID array was operating in
degraded mode, it would fail analogously to a single-disk filesystem:
Reboot, fsck (and possibly lose some un-synced data), and keep going.
In fact, it looks more like: Reboot, make the sysadmin force a RAID
configure, then fsck (and possibly lose data).

> > May  8 16:10:46 gamera /netbsd: sd2(siop1:0:0): command timeout
> "Uh oh.."  This will not make RAIDframe happy if the IO didn't complete..=
. but=20
> no big deal... RAID 5 will deal with a single disk failure...  I'm guessi=
> that sd2e, sd3e, and sd4e are all in this RAID 5 set... but where is sd4?=
> If it's not there, then you've already had 2 failures, which is fatal in =
> RAID 5 set...

Yeah.  I don't know the cause of the underlying error; I'll have to
investigate that.  This machine has been giving me a lot of trouble,
though, with various SCSI controllers, cables, drives, and
enclosures.  Sometimes I wonder if SCSI just doesn't like me...

> > May  8 16:10:47 gamera /netbsd: sd3(siop1:1:0): parity error
> "Uh Oh#2".  If sd3 is in the RAID set, RAIDframe is going to be really up=
> as with 2 (or is it 3 now?) disks gone, it's pretty much game over. =20
> (And RAIDframe will turn out the lights, rather than continuing on...)=20
> (It should probably just stop doing IO's to the RAID set, but that's a=20
> different problem).

Aha.  So you're saying that all RAID sets will fail (or more
accurately, RAIDframe will fail) in the event of a double disk
failure?  That's fine, really; it's just something I wasn't aware of.

> > May  8 16:10:47 gamera /netbsd: siop1: scsi bus reset
> > May  8 16:10:47 gamera /netbsd: cmd 0xc06700c0 (target 0:0) in reset li=
> >=20
> > =3D2E..and then it crashed.  The console had some message about RAIDfra=
> > being unable to allocate a DAG.  I didn't write it down or get a
> > backtrace, because I knew it would make a core dump.  :-/
> Writing it down would not have caused a core dump, and would have helped=
> confirmed what I suspect happened.  Basically when 2 disks in a RAID 5 se=
> fail, RAIDframe gives up.  And by the looks of it, the machine had some=
> serious SCSI problems, errors were returned to RAIDframe, RAIDframe marke=
d the=20
> components as failed, and when more than one component failed, RAIDframe =
> "enough".

:)  I didn't mean to say that I thought writing down a backtrace would
cause a core dump.  I meant to say that I didn't bother, because I
assumed that this crash would trigger a core dump, regardless.  Which
it did.  I just wasn't able to grab the core dump off the dump device
when it came back up.

> > Problem 2:  I'd like to get raid1 back up again, but it won't
> > configure:
> > May  8 16:38:31 gamera /netbsd: raidlookup on device: /dev/sd4e failed!
> > May  8 16:38:31 gamera /netbsd: Hosed component: /dev/sd4e
> So sd4 is no longer on the system?  Oh... looks like it wasn't there=20
> before on May 7???  (Unless the logs you have here are incomplete...)

I included all information relevant to scsibuses and my fxp0.  If you
want more, I can certainly send it.  :)  Though it sounds like that's
all irrelevant to what's going on.

> [...]  If,
> however, sd4 was *not* in the system and you were running in
> degraded mode from the get-go, then you can just use sd2 and sd3,
> and things should be reasonably ok.  (there will likely be some
> filesystem lossage, but hopefully not much.)  Once we know whether
> sd4 was there or not we'll have a better idea of what components you
> want to forcibly configure...

Yeah; sd2 and sd3.  I've done that, and I'm in the process of running
fsck now.  For what it's worth, given the SCSI errors.  Sigh.


Chris Jones                                          Mad scientist at large

Content-Type: application/pgp-signature
Content-Disposition: inline

Version: GnuPG v1.0.4 (NetBSD)
Comment: For info see