current-users: Re: RAIDframe crash

Subject: Re: RAIDframe crash
To: Greg Oster <oster@cs.usask.ca>
From: Chris Jones <chris@cjones.org>
List: current-users
Date: 05/08/2001 17:41:27
--P6YfpwaDcfcOCJkJ
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

Thanks for the verbose help, Greg.

On Tue, May 08, 2001 at 05:20:58PM -0600, Greg Oster wrote:

> The date jumps here... is there data missing?  If so, at least some of it=
 is=20
> critical to solving this...=20

Yeah; sorry for my incomplete report.  As you guessed later on, sd4
wasn't in the system at all.  I was basically testing the whole thing
to see how well it's going to work, before I put it into production.
So I've got two out of my three disks online, and I'm thrashing the
filesystems.

I guess I was hoping that, while the RAID array was operating in
degraded mode, it would fail analogously to a single-disk filesystem:
Reboot, fsck (and possibly lose some un-synced data), and keep going.
In fact, it looks more like: Reboot, make the sysadmin force a RAID
configure, then fsck (and possibly lose data).

> > May  8 16:10:46 gamera /netbsd: sd2(siop1:0:0): command timeout
>=20
> "Uh oh.."  This will not make RAIDframe happy if the IO didn't complete..=
. but=20
> no big deal... RAID 5 will deal with a single disk failure...  I'm guessi=
ng=20
> that sd2e, sd3e, and sd4e are all in this RAID 5 set... but where is sd4?=
 =20
> If it's not there, then you've already had 2 failures, which is fatal in =
a=20
> RAID 5 set...

Yeah.  I don't know the cause of the underlying error; I'll have to
investigate that.  This machine has been giving me a lot of trouble,
though, with various SCSI controllers, cables, drives, and
enclosures.  Sometimes I wonder if SCSI just doesn't like me...

> > May  8 16:10:47 gamera /netbsd: sd3(siop1:1:0): parity error
>=20
> "Uh Oh#2".  If sd3 is in the RAID set, RAIDframe is going to be really up=
set,=20
> as with 2 (or is it 3 now?) disks gone, it's pretty much game over. =20
> (And RAIDframe will turn out the lights, rather than continuing on...)=20
> (It should probably just stop doing IO's to the RAID set, but that's a=20
> different problem).

Aha.  So you're saying that all RAID sets will fail (or more
accurately, RAIDframe will fail) in the event of a double disk
failure?  That's fine, really; it's just something I wasn't aware of.

> > May  8 16:10:47 gamera /netbsd: siop1: scsi bus reset
> > May  8 16:10:47 gamera /netbsd: cmd 0xc06700c0 (target 0:0) in reset li=
st
> >=20
> > =3D2E..and then it crashed.  The console had some message about RAIDfra=
me
> > being unable to allocate a DAG.  I didn't write it down or get a
> > backtrace, because I knew it would make a core dump.  :-/
>=20
> Writing it down would not have caused a core dump, and would have helped=
=20
> confirmed what I suspect happened.  Basically when 2 disks in a RAID 5 se=
t=20
> fail, RAIDframe gives up.  And by the looks of it, the machine had some=
=20
> serious SCSI problems, errors were returned to RAIDframe, RAIDframe marke=
d the=20
> components as failed, and when more than one component failed, RAIDframe =
said=20
> "enough".

:)  I didn't mean to say that I thought writing down a backtrace would
cause a core dump.  I meant to say that I didn't bother, because I
assumed that this crash would trigger a core dump, regardless.  Which
it did.  I just wasn't able to grab the core dump off the dump device
when it came back up.

> > Problem 2:  I'd like to get raid1 back up again, but it won't
> > configure:
> > May  8 16:38:31 gamera /netbsd: raidlookup on device: /dev/sd4e failed!
> > May  8 16:38:31 gamera /netbsd: Hosed component: /dev/sd4e
>=20
> So sd4 is no longer on the system?  Oh... looks like it wasn't there=20
> before on May 7???  (Unless the logs you have here are incomplete...)

I included all information relevant to scsibuses and my fxp0.  If you
want more, I can certainly send it.  :)  Though it sounds like that's
all irrelevant to what's going on.

> [...]  If,
> however, sd4 was *not* in the system and you were running in
> degraded mode from the get-go, then you can just use sd2 and sd3,
> and things should be reasonably ok.  (there will likely be some
> filesystem lossage, but hopefully not much.)  Once we know whether
> sd4 was there or not we'll have a better idea of what components you
> want to forcibly configure...

Yeah; sd2 and sd3.  I've done that, and I'm in the process of running
fsck now.  For what it's worth, given the SCSI errors.  Sigh.

Chris

--=20
---------------------------------------------------- chris@cjones.org
Chris Jones                                          Mad scientist at large
  www.netbsd.org www.postgresql.org www.schemers.org www.python.org

--P6YfpwaDcfcOCJkJ
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.4 (NetBSD)
Comment: For info see http://www.gnupg.org

iEYEARECAAYFAjr4hCcACgkQDPY2T8RzaD8SpwCfW8r9N6sHXHxzwXT27kCKhFrY
YFgAn1mvqvzBEiKGASMdrhDQo79elzD3
=bayP
-----END PGP SIGNATURE-----

--P6YfpwaDcfcOCJkJ--