Subject: Re: raidframe problems (revisited)
To: Greg Troxel <gdt@ir.bbn.com>
From: Louis Guillaume <lguillaume@berklee.edu>
List: netbsd-users
Date: 05/28/2007 02:03:46
Greg Troxel wrote:

> I use one of the SATA system to store pictures, and thus often do ~"cp
> /cf/*.jpg ~/PICTURES/foo" with a GB or so of pictures.  On this
> system, I found corrupt images, and the two halves of the raid set
> were sometimes different.  I ran memtest+ for days without a failure
> (memtesters can only prove trouble; they can't prove it's ok).  I then
> pulled one of the 2 1GB sticks and have not had a single problem
> since.  So, I suspect that either the memory is bad, or the power
> supply is weak, or something like that.
> 

By any chance, Greg, are your other systems that use raidframe subjected
to the same kind of massive i/o as this one?

I have noticed that my raidframe issues only seem to crop up when large
amounts of data are copied onto those disks.

The problem you describe here is exactly the problem I have been seeing
with the exception that pulling, replacing swapping my memory makes no
difference.

> The other hypothesis is that raidframe is buggy

I'm starting to believe that there is something funky going on inside of
raidframe itself. I've seen this behaviour on different hardware,
different disks different memory different power supplies. Perhaps there
is a way to identify what types of hardware have these problems.

I've been using my SATA drives for a month or so now with one side of
the RAID1 failed like this...

Components:
           /dev/wd0a: optimal
           /dev/wd1a: failed
No spares.

... and it is flawless.

I am extremely confident and I promise you that if I reconstruct this
array, I'll see the corruption.

Another interesting manifestation of the corruption was when my wife
started using Electric Sheep as her screen saver. Her home directory is
on a NFS-exported filesystem on this same array (raid1).

Before the last reconstruct I backed everything up. After
reconstruction, each new sheep her machine downloaded showed strange
artifacts and some had a kind of "scrambled" look. But everything
worked, strangely enough.

After failing wd1a and restoring from backup, all of her sheep work
normally.

I've had this same problem on a Pentium Pro, Pentium III and Athlon
systems. Have swapped drives, cables, memory, power supplies. The
constant is the way the system is used: a file server, sharing home
directories and other stuff over NFS and netatalk to NetBSD, Linux and
Mac systems.

The next thing I will do is attempt to reproduce the problem on a
completely different machine that hasn't been involved in any of this
and see what happens.

Any other ideas around where I can go from here would be great. Also if
anyone is interested in trying to reproduce the problem it would
certainly rule ME out as the problem :)

Louis