Subject: Re: raidframe problems (revisited)
To: Louis Guillaume <firstname.lastname@example.org>
From: Greg Oster <email@example.com>
Date: 05/28/2007 16:50:03
Louis Guillaume writes:
> Greg Troxel wrote:
> > The other hypothesis is that raidframe is buggy
> I'm starting to believe that there is something funky going on inside of
> raidframe itself. I've seen this behaviour on different hardware,
> different disks different memory different power supplies. Perhaps there
> is a way to identify what types of hardware have these problems.
In the past 8.5 years, 99% of "raidframe bugs" have been hardware
issues or "something other than RAIDframe". I don't know how many
hours I've hunted for RAIDframe problems that wern't really there :-}
That said, if this is a RAIDframe issue, I'm more than happy to help
track it down and fix it...
> I've been using my SATA drives for a month or so now with one side of
> the RAID1 failed like this...
> /dev/wd0a: optimal
> /dev/wd1a: failed
> No spares.
> ... and it is flawless.
> I am extremely confident and I promise you that if I reconstruct this
> array, I'll see the corruption.
> Another interesting manifestation of the corruption was when my wife
> started using Electric Sheep as her screen saver. Her home directory is
> on a NFS-exported filesystem on this same array (raid1).
> Before the last reconstruct I backed everything up. After
> reconstruction, each new sheep her machine downloaded showed strange
> artifacts and some had a kind of "scrambled" look. But everything
> worked, strangely enough.
> After failing wd1a and restoring from backup, all of her sheep work
> I've had this same problem on a Pentium Pro, Pentium III and Athlon
> systems. Have swapped drives, cables, memory, power supplies. The
> constant is the way the system is used: a file server, sharing home
> directories and other stuff over NFS and netatalk to NetBSD, Linux and
> Mac systems.
> The next thing I will do is attempt to reproduce the problem on a
> completely different machine that hasn't been involved in any of this
> and see what happens.
> Any other ideas around where I can go from here would be great. Also if
> anyone is interested in trying to reproduce the problem it would
> certainly rule ME out as the problem :)
With the array in degraded mode, can you mount /dev/wd1a (or
equivalent) as a filesystem, and run a series of stress-tests on
that, at the same time that you stress the RAID set? Something like:
foreach i (`jot 1000`)
cp src.tar.gz src.tar.gz.$i && rm -f src.tar.gz.$i &
dd if=/dev/zero of=bigfile.$i bs=10m count=100 && rm -f bigfile.$i &
dd if=src.tar.gz.$i of=/dev/null bs=10m &
that end up running on both wd0a and wd1a at the same time. In an
ideal world, take RAIDframe out of the equation entirely, and push
the disks, both reads and writes... (If you have an area reserved for
swap on both, you could disable swap, and use that space). And then
once the disks are "busy", do something like extract src.tar.gz to
both wd0a and wd1a, and compare the bits as extracted and see if
there are differences. (You'll need to tune things so you don't run
out of space, of course)
I suspect it's a drive controller issue (or driver issue) that only
manifests itself when you push both channels really hard...