Subject: Re: RAIDframe crash again
To: Greg Oster <oster@cs.usask.ca>
From: Kazushi (Jam) Marukawa <jam@pobox.com>
List: current-users
Date: 07/12/2001 22:48:52
   On Jul 12, 20:01, Greg Oster wrote:
   > Subject: Re: RAIDframe crash again
   > Kazushi Marukawa writes:
   > > The real reason is two hard drives failure in a 4 drives
   > > RAID5 system.  Then, system was crashed.  Is there any way
   > > to stop this crash? 
   > 
   > No.  (I had a look the other day at trying to make it not panic on a
   > 2-component failure, but didn't get very far :( )

I see.  Thank you for telling me that.  I understand it is
RAIDframe way.  Knowing it is better than nothing.

It's little bad for me since current FT100 IDE driver is not
stable under some "going wrong" condition.  It's sensitive
and causes unrecoverable error if the situation is going
bad.  I like this driver and appreciate to the author
though.  Maybe I should put some syslog checker into my
crontab to check those wd error messages as soon as
possible.

   > > A copy of messages is below.  This is
   > > not all, I just grepped it by "raid" keyword.
   > [snip]
   > The more interesting bits will be from *before* the first "raid0: IO Error."
   > In particular, you need to find out *why* it said wd3e and wd1e failed.

There are tons of DMA error and correction messages for
both.  I should notice them, but I didn't.  Then, system was
crashed.  For the people who have an interest, check below.
It's contains all hard disk related messages starting from
one day before the crash day.

http://www.io.com/~kazushi/message-before-crash

   > > Both hard drives that raid marked failure are OK with
   > > manufacture's test program.  Maybe, those are going bad now,
   > > but it works for now.
   > 
   > Could be cabling/heat/power issues too.  How long have you been
   > running this  RAID set?  

I've been using this machine several months in the same
configuration.  This machine had been on from June 10 when
it was crashed.  I think this is caused by heat, since I
don't have air conditioner here and these days, it's hot.
I'm using HDD coolers to keep them as cool as possible
though.

On the other hand, I remember the same situation I
experienced when one hard drive was running out.  It showed
DMA error sometimes, then showed it often, and finally gone.

   > > I could copy those files
   > > into the original place.  Here is a trace after this crash.
   > 
   > Hmmm... Did you get a copy of the panic message?  Hard to tell exactly 
   > why it died here...  

There was no panic message as far as I remember.  It crashed
silently.

   > Could you ship me (privately) a copy of your raid config files and of
   > /var/run/dmesg.boot?  Thanks.

Yes.  I'll send them later.  Thanks.

Regards,
-- Kazushi
Eggheads unite!  You have nothing to lose but your yolks.
		-- Adlai Stevenson