Subject: Re: RAIDframe crash
To: Greg Oster <oster@cs.usask.ca>
From: Chris Jones <chris@cjones.org>
List: current-users
Date: 05/08/2001 17:41:27
--P6YfpwaDcfcOCJkJ
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
Thanks for the verbose help, Greg.
On Tue, May 08, 2001 at 05:20:58PM -0600, Greg Oster wrote:
> The date jumps here... is there data missing? If so, at least some of it=
is=20
> critical to solving this...=20
Yeah; sorry for my incomplete report. As you guessed later on, sd4
wasn't in the system at all. I was basically testing the whole thing
to see how well it's going to work, before I put it into production.
So I've got two out of my three disks online, and I'm thrashing the
filesystems.
I guess I was hoping that, while the RAID array was operating in
degraded mode, it would fail analogously to a single-disk filesystem:
Reboot, fsck (and possibly lose some un-synced data), and keep going.
In fact, it looks more like: Reboot, make the sysadmin force a RAID
configure, then fsck (and possibly lose data).
> > May 8 16:10:46 gamera /netbsd: sd2(siop1:0:0): command timeout
>=20
> "Uh oh.." This will not make RAIDframe happy if the IO didn't complete..=
. but=20
> no big deal... RAID 5 will deal with a single disk failure... I'm guessi=
ng=20
> that sd2e, sd3e, and sd4e are all in this RAID 5 set... but where is sd4?=
=20
> If it's not there, then you've already had 2 failures, which is fatal in =
a=20
> RAID 5 set...
Yeah. I don't know the cause of the underlying error; I'll have to
investigate that. This machine has been giving me a lot of trouble,
though, with various SCSI controllers, cables, drives, and
enclosures. Sometimes I wonder if SCSI just doesn't like me...
> > May 8 16:10:47 gamera /netbsd: sd3(siop1:1:0): parity error
>=20
> "Uh Oh#2". If sd3 is in the RAID set, RAIDframe is going to be really up=
set,=20
> as with 2 (or is it 3 now?) disks gone, it's pretty much game over. =20
> (And RAIDframe will turn out the lights, rather than continuing on...)=20
> (It should probably just stop doing IO's to the RAID set, but that's a=20
> different problem).
Aha. So you're saying that all RAID sets will fail (or more
accurately, RAIDframe will fail) in the event of a double disk
failure? That's fine, really; it's just something I wasn't aware of.
> > May 8 16:10:47 gamera /netbsd: siop1: scsi bus reset
> > May 8 16:10:47 gamera /netbsd: cmd 0xc06700c0 (target 0:0) in reset li=
st
> >=20
> > =3D2E..and then it crashed. The console had some message about RAIDfra=
me
> > being unable to allocate a DAG. I didn't write it down or get a
> > backtrace, because I knew it would make a core dump. :-/
>=20
> Writing it down would not have caused a core dump, and would have helped=
=20
> confirmed what I suspect happened. Basically when 2 disks in a RAID 5 se=
t=20
> fail, RAIDframe gives up. And by the looks of it, the machine had some=
=20
> serious SCSI problems, errors were returned to RAIDframe, RAIDframe marke=
d the=20
> components as failed, and when more than one component failed, RAIDframe =
said=20
> "enough".
:) I didn't mean to say that I thought writing down a backtrace would
cause a core dump. I meant to say that I didn't bother, because I
assumed that this crash would trigger a core dump, regardless. Which
it did. I just wasn't able to grab the core dump off the dump device
when it came back up.
> > Problem 2: I'd like to get raid1 back up again, but it won't
> > configure:
> > May 8 16:38:31 gamera /netbsd: raidlookup on device: /dev/sd4e failed!
> > May 8 16:38:31 gamera /netbsd: Hosed component: /dev/sd4e
>=20
> So sd4 is no longer on the system? Oh... looks like it wasn't there=20
> before on May 7??? (Unless the logs you have here are incomplete...)
I included all information relevant to scsibuses and my fxp0. If you
want more, I can certainly send it. :) Though it sounds like that's
all irrelevant to what's going on.
> [...] If,
> however, sd4 was *not* in the system and you were running in
> degraded mode from the get-go, then you can just use sd2 and sd3,
> and things should be reasonably ok. (there will likely be some
> filesystem lossage, but hopefully not much.) Once we know whether
> sd4 was there or not we'll have a better idea of what components you
> want to forcibly configure...
Yeah; sd2 and sd3. I've done that, and I'm in the process of running
fsck now. For what it's worth, given the SCSI errors. Sigh.
Chris
--=20
---------------------------------------------------- chris@cjones.org
Chris Jones Mad scientist at large
www.netbsd.org www.postgresql.org www.schemers.org www.python.org
--P6YfpwaDcfcOCJkJ
Content-Type: application/pgp-signature
Content-Disposition: inline
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.4 (NetBSD)
Comment: For info see http://www.gnupg.org
iEYEARECAAYFAjr4hCcACgkQDPY2T8RzaD8SpwCfW8r9N6sHXHxzwXT27kCKhFrY
YFgAn1mvqvzBEiKGASMdrhDQo79elzD3
=bayP
-----END PGP SIGNATURE-----
--P6YfpwaDcfcOCJkJ--