Subject: Re: raidframe re-mirroring (cont'd)
To: Louis Guillaume <lguillaume@berklee.edu>
From: Greg Oster <oster@cs.usask.ca>
List: current-users
Date: 08/13/2004 08:34:02
Louis Guillaume writes:
> Hi Everyone,
> 
> I posted a few weeks ago about a problem I had with a raid set, where 
> one disk was failed and I wanted to bring it back online. Here's what 
> happened...
> 
> . Booted into single-user
> 
> . Rebuilt all arrays on the pair of disks: raid0 raid1 raid2 raid3 raid4 
> - all raid-1. It's set up like this...
> 
> #############################
> raid0 raid1 raid2 raid3 raid4
> 
> wd0a  wd0e  wd0f  wd0g  wd0b
> wd1a  wd1e  wd1f  wd1g  wd1b
> 
> /     /usr  /var  /home swap
> #############################
> 
> . fsck-ed all filesystems. reboot
> 
> Immediately, I noticed apache2 and spamass-milter fail during startup 
> (recently built from pkgsrc and very reliable). Immediatiely!

How do they fail?  What do they do/not do? (i.e. what is the nature 
of the error?)

> This is 
> what caused me to believe the second disk was bad in the first place.
> 
> Now I believed that the disk was actually bad and not the kernel/raidframe.
> 
> . Rebooted back to single user.
> . Failed all wd1 raid components.
> . fsck (finds and fixes errors) and reboot again.
> 
> All is well! For a week and a half, not a hitch.
> 
> More reason to believe it's the disk.
> 
> . Replace suspect disk with another one, disklabeled raidctl -a ...etc.
> 
> . Incorporated new spare components into arrays.
> 
> . rebooted. raidctl -F ... , fsck , reboot.
> 
> SAME FAILURES as before!! Apache2 and spamass-milter are the first to 
> go. In the past I had not noticed these right away and kept running.
> 
> This is very strange. I'd really like to get my redundancy back. But 
> once again, I'm running on a set of single-component raid-1 arrays.
> 
> Here is some other information that may be useful...
> 
> Machine - i386
> Problem first noticed at NetBSD-2.0E GENERIC.MP kernel
> Still a problem at NetBSD-2.0G GENERIC.MP kernel
> 
> I'm guessing my disk is good. The machine runs great on one disk. Weeks 
> of uptime - even months without a peep. So I'm not thinking that there's 
> a memory problem as someone suggested earlier.
> 
> The only other thing I can think of is perhaps the ribbon cable from the 
> board to the disk. But if that was bad, wouldn't we have much more 
> obvious issues?
> 
> I don't know if this is a config problem, or something else. But there 
> definitely is a strange problem that's preventing me from mirroring 
> successfully.
> 
> Perhaps too many raid devices on one pair of disks?

No.

> Maybe problems with MP kernel and raidframe?

Not supposed to be.  I havn't seen anything here that would suggest 
that... 
 
> Any help would be great. Please let me know if I can provide more 
> information.

The apache/milter errors would be useful.  RAID config files and a 
'dmesg' output would also help.

Have you tried isolating which of the RAID sets seems to be causing 
the problem?

Later...

Greg Oster