netbsd-users: fs corruption with raidframe (again)

Subject: fs corruption with raidframe (again)
To: NetBSD Users <netbsd-users@NetBSD.org>
From: Louis Guillaume <lguillaume@berklee.edu>
List: netbsd-users
Date: 01/10/2005 18:21:11
Hi Everyone,

Once again I have a suspicion that RaidFrame (RAID-1) is causing some 
file system corruption.

I noticed this last year and posted here. But wasn't really able to do 
much troubleshooting. At that time I had a separate raid device for each 
filesystem and swap. After failing all the components on one of the 
disks, the machine ran fine for months. I sync'd the second component 
for the root partition, and it seemed fine. That also ran for months and 
months. A couple of times I attempted to bring on the "/usr" partition's 
second component, and all of a sudden I'd see file system corruption 
(similar to what's described below). As soon as that offending component 
was removed, all was well.

Now I have reverted to my original scenario - One raid device for all 
file systems. As soon as the second component was added to the array, I 
began to see the problems...

Here is the scenario...

NetBSD 2.0 (GENERIC.MP) #0: Wed Dec  1 11:06:48 UTC 2004
 
builds@build:/big/builds/ab/netbsd-2-0-RELEASE/i386/200411300000Z-obj/big/builds/ab/netbsd-2
-0-RELEASE/src/sys/arch/i386/compile/GENERIC.MP
total memory = 255 MB
avail memory = 242 MB

...
cpu0: Intel Pentium III (686-class), 996.87 MHz, id 0x68a
cpu1: Intel Pentium III (686-class), 996.84 MHz, id 0x68a

wd0 at atabus0 drive 0: <Maxtor 52049H4>
wd0: drive supports 16-sector PIO transfers, LBA addressing
wd0: 19541 MB, 39703 cyl, 16 head, 63 sec, 512 bytes/sect x 40020624 sectors
wd0: 32-bit data port
wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100)
wd0(rccide0:0:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 2 
(Ultra/33) (using DMA data transfer
s)
wd1 at atabus1 drive 0: <Maxtor 52049H3>
wd1: drive supports 16-sector PIO transfers, LBA addressing
wd1: 19541 MB, 39704 cyl, 16 head, 63 sec, 512 bytes/sect x 40021632 sectors
wd1: 32-bit data port
wd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100)
wd1(rccide0:1:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 2 
(Ultra/33) (using DMA data transfer
s)
raid0: RAID Level 1
raid0: Components: /dev/wd0a /dev/wd1a
raid0: Total Sectors: 39102208 (19092 MB)
boot device: raid0
root on raid0a dumps on raid0b
root file system type: ffs

################################################

# raidctl -s raid0
Components:
            /dev/wd0a: optimal
            /dev/wd1a: optimal
No spares.
Component label for /dev/wd0a:
    Row: 0, Column: 0, Num Rows: 1, Num Columns: 2
    Version: 2, Serial Number: 2004123000, Mod Counter: 1303
    Clean: No, Status: 0
    sectPerSU: 128, SUsPerPU: 1, SUsPerRU: 1
    Queue size: 100, blocksize: 512, numBlocks: 39102208
    RAID Level: 1
    Autoconfig: Yes
    Root partition: Yes
    Last configured as: raid0
Component label for /dev/wd1a:
    Row: 0, Column: 1, Num Rows: 1, Num Columns: 2
    Version: 2, Serial Number: 2004123000, Mod Counter: 1303
    Clean: No, Status: 0
    sectPerSU: 128, SUsPerPU: 1, SUsPerRU: 1
    Queue size: 100, blocksize: 512, numBlocks: 39102208
    RAID Level: 1
    Autoconfig: Yes
    Root partition: Yes
    Last configured as: raid0
Parity status: clean
Reconstruction is 100% complete.
Parity Re-write is 100% complete.
Copyback is 100% complete.

################################################

After setting up raid0 on wd1, i.e. before syncing with wd0, the system 
ran without a hitch for several days.

As soon as I sync'd with wd0 and rebooted, apache failed to start, as 
did spamd, as "/usr/pkg/lib/perl5/5.8.5/i386-netbsd/CORE/libperl.so" was 
corrupted.

I "make replace"'d the perl package and that fixed the problem.

The next few days, I started seeing these symptoms I see are in my daily 
and insecurity output...


################################################
Checking setuid files and devices:
Setuid/device find errors:
find: /dev/rwt8: Bad file descriptor

Device deletions:
crw-rw---- 1 root operator 10, 8 Dec 26 21:57:31 2004 /dev/rwt8


mtree: dev/rwt8: Bad file descriptor

################################################
Uptime:  3:15AM up 3 days, 33 mins, 1 user, load averages: 0.21, 0.41, 0.45
find: /usr/share/man/man9/psignal.9: Bad file descriptor

################################################



Perhaps someone can replicate this. Please let me know if there is 
anything more I can do to test what might be the problem here. The 
corruption seems minor - all my stuff still works (for now). But it does 
worry me.

Any idea what could be causing this? Please let me know if I can provide 
more information. Thanks,

Louis