netbsd-users: Re: raidframe problems (revisited)

Subject: Re: raidframe problems (revisited)
To: Louis Guillaume <lguillaume@berklee.edu>
From: Greg Troxel <gdt@ir.bbn.com>
List: netbsd-users
Date: 03/10/2007 20:12:22
  From the minute the parity is clean on a fresh new raid1, I make
  filesystems, and move data on to them, then unmount and fsck. Problems
  are found.

One thing to keep in mind is that raidframe, even if it is working
flawlessly, is more demanding on the hardware because with RAID-1 all
writes have to be done on both disks.  I don't think there is any
dithering and thus would expect these happening at the same time.  So
a marginal component may be more likely to be stressed and fail.

I'm running raidframe on 5 systems, all i386, and all RAID-1.  Two
have PATA disks, and one of those has been running since fall 2001.
It's so reliable I forget about it, but I just checked and all raid
sets are reporting 'optimal'.  Three are new and are 2 x Seagate 400G
SATA drives.  Two have been flawless.

I use one of the SATA system to store pictures, and thus often do ~"cp
/cf/*.jpg ~/PICTURES/foo" with a GB or so of pictures.  On this
system, I found corrupt images, and the two halves of the raid set
were sometimes different.  I ran memtest+ for days without a failure
(memtesters can only prove trouble; they can't prove it's ok).  I then
pulled one of the 2 1GB sticks and have not had a single problem
since.  So, I suspect that either the memory is bad, or the power
supply is weak, or something like that.

The other hypothesis is that raidframe is buggy (this was with
-current last spring).  But, two other systems with the same hardware
(copying specs for purchase, same shop, same mobo, same cpu, same
disks, same memory brand, only difference is one has 4G instead of 2)
have had no problems whatsover.

My conclusion is that memory tests need to do simultaneous IO
operations to really test the hardware.

So I can believe that your system doesn't work with raid, but does
with single disks.  I'd be suspicious of the power supply and your
memory.

To test, I'd mount the underlying filesystems ro, and compare.  I have
wd0a and wd1a each 63-end, type RAID, and raid0 is partitioned / /var
/usr /home pretty normally.  I then have wd0 efgh matching raid0 efga.
This is tricky, but if you add the RAID partition start, the RAID
offset (64), and the within-raid offset, you get the starting sector
of one of the copies of the filesystem.  (This only works for RAID-1.)
Then, after putting files in /home, I can mount wd0g and wd1g and sha1
the files from the underlying disks.  I found a few different.
Looking at the differences, the bad version would have 3-20 bytes with
extra bits set, usually 0200.   This feels more like memory problems
than raidframe bugs.

If you can pull each half your memory in turn and retest that would be
interesting.


Your problem sounds worse.