netbsd-users: Re: raidframe problems (revisited)

Subject: Re: raidframe problems (revisited)
To: Louis Guillaume <lguillaume@berklee.edu>
From: Greg Oster <oster@cs.usask.ca>
List: netbsd-users
Date: 05/28/2007 16:50:03
Louis Guillaume writes:
> Greg Troxel wrote:
> 
> > The other hypothesis is that raidframe is buggy
> 
> I'm starting to believe that there is something funky going on inside of
> raidframe itself. I've seen this behaviour on different hardware,
> different disks different memory different power supplies. Perhaps there
> is a way to identify what types of hardware have these problems.

In the past 8.5 years, 99% of "raidframe bugs" have been hardware 
issues or "something other than RAIDframe".  I don't know how many 
hours I've hunted for RAIDframe problems that wern't really there :-}

That said, if this is a RAIDframe issue, I'm more than happy to help 
track it down and fix it... 

> I've been using my SATA drives for a month or so now with one side of
> the RAID1 failed like this...
> 
> Components:
>            /dev/wd0a: optimal
>            /dev/wd1a: failed
> No spares.
> 
> ... and it is flawless.
> 
> I am extremely confident and I promise you that if I reconstruct this
> array, I'll see the corruption.
> 
> Another interesting manifestation of the corruption was when my wife
> started using Electric Sheep as her screen saver. Her home directory is
> on a NFS-exported filesystem on this same array (raid1).
> 
> Before the last reconstruct I backed everything up. After
> reconstruction, each new sheep her machine downloaded showed strange
> artifacts and some had a kind of "scrambled" look. But everything
> worked, strangely enough.
> 
> After failing wd1a and restoring from backup, all of her sheep work
> normally.
> 
> I've had this same problem on a Pentium Pro, Pentium III and Athlon
> systems. Have swapped drives, cables, memory, power supplies. The
> constant is the way the system is used: a file server, sharing home
> directories and other stuff over NFS and netatalk to NetBSD, Linux and
> Mac systems.
> 
> The next thing I will do is attempt to reproduce the problem on a
> completely different machine that hasn't been involved in any of this
> and see what happens.
> 
> Any other ideas around where I can go from here would be great. Also if
> anyone is interested in trying to reproduce the problem it would
> certainly rule ME out as the problem :)

With the array in degraded mode, can you mount /dev/wd1a (or 
equivalent) as a filesystem, and run a series of stress-tests on 
that, at the same time that you stress the RAID set?  Something like:

  foreach i (`jot 1000`)
  cp src.tar.gz src.tar.gz.$i && rm -f src.tar.gz.$i & 
  sleep 10
  dd if=/dev/zero of=bigfile.$i bs=10m count=100 && rm -f bigfile.$i &
  sleep 10
  dd if=src.tar.gz.$i of=/dev/null bs=10m &
  end

that end up running on both wd0a and wd1a at the same time.  In an 
ideal world, take RAIDframe out of the equation entirely, and push 
the disks, both reads and writes... (If you have an area reserved for 
swap on both, you could disable swap, and use that space).  And then 
once the disks are "busy", do something like extract src.tar.gz to 
both wd0a and wd1a, and compare the bits as extracted and see if 
there are differences.  (You'll need to tune things so you don't run 
out of space, of course)

I suspect it's a drive controller issue (or driver issue) that only 
manifests itself when you push both channels really hard... 

Later...

Greg Oster