Subject: fs corruption with raidframe (again)
To: NetBSD Users <netbsd-users@NetBSD.org>
From: Louis Guillaume <lguillaume@berklee.edu>
List: netbsd-users
Date: 01/10/2005 18:21:11
Hi Everyone,
Once again I have a suspicion that RaidFrame (RAID-1) is causing some
file system corruption.
I noticed this last year and posted here. But wasn't really able to do
much troubleshooting. At that time I had a separate raid device for each
filesystem and swap. After failing all the components on one of the
disks, the machine ran fine for months. I sync'd the second component
for the root partition, and it seemed fine. That also ran for months and
months. A couple of times I attempted to bring on the "/usr" partition's
second component, and all of a sudden I'd see file system corruption
(similar to what's described below). As soon as that offending component
was removed, all was well.
Now I have reverted to my original scenario - One raid device for all
file systems. As soon as the second component was added to the array, I
began to see the problems...
Here is the scenario...
NetBSD 2.0 (GENERIC.MP) #0: Wed Dec 1 11:06:48 UTC 2004
builds@build:/big/builds/ab/netbsd-2-0-RELEASE/i386/200411300000Z-obj/big/builds/ab/netbsd-2
-0-RELEASE/src/sys/arch/i386/compile/GENERIC.MP
total memory = 255 MB
avail memory = 242 MB
...
cpu0: Intel Pentium III (686-class), 996.87 MHz, id 0x68a
cpu1: Intel Pentium III (686-class), 996.84 MHz, id 0x68a
wd0 at atabus0 drive 0: <Maxtor 52049H4>
wd0: drive supports 16-sector PIO transfers, LBA addressing
wd0: 19541 MB, 39703 cyl, 16 head, 63 sec, 512 bytes/sect x 40020624 sectors
wd0: 32-bit data port
wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100)
wd0(rccide0:0:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 2
(Ultra/33) (using DMA data transfer
s)
wd1 at atabus1 drive 0: <Maxtor 52049H3>
wd1: drive supports 16-sector PIO transfers, LBA addressing
wd1: 19541 MB, 39704 cyl, 16 head, 63 sec, 512 bytes/sect x 40021632 sectors
wd1: 32-bit data port
wd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100)
wd1(rccide0:1:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 2
(Ultra/33) (using DMA data transfer
s)
raid0: RAID Level 1
raid0: Components: /dev/wd0a /dev/wd1a
raid0: Total Sectors: 39102208 (19092 MB)
boot device: raid0
root on raid0a dumps on raid0b
root file system type: ffs
################################################
# raidctl -s raid0
Components:
/dev/wd0a: optimal
/dev/wd1a: optimal
No spares.
Component label for /dev/wd0a:
Row: 0, Column: 0, Num Rows: 1, Num Columns: 2
Version: 2, Serial Number: 2004123000, Mod Counter: 1303
Clean: No, Status: 0
sectPerSU: 128, SUsPerPU: 1, SUsPerRU: 1
Queue size: 100, blocksize: 512, numBlocks: 39102208
RAID Level: 1
Autoconfig: Yes
Root partition: Yes
Last configured as: raid0
Component label for /dev/wd1a:
Row: 0, Column: 1, Num Rows: 1, Num Columns: 2
Version: 2, Serial Number: 2004123000, Mod Counter: 1303
Clean: No, Status: 0
sectPerSU: 128, SUsPerPU: 1, SUsPerRU: 1
Queue size: 100, blocksize: 512, numBlocks: 39102208
RAID Level: 1
Autoconfig: Yes
Root partition: Yes
Last configured as: raid0
Parity status: clean
Reconstruction is 100% complete.
Parity Re-write is 100% complete.
Copyback is 100% complete.
################################################
After setting up raid0 on wd1, i.e. before syncing with wd0, the system
ran without a hitch for several days.
As soon as I sync'd with wd0 and rebooted, apache failed to start, as
did spamd, as "/usr/pkg/lib/perl5/5.8.5/i386-netbsd/CORE/libperl.so" was
corrupted.
I "make replace"'d the perl package and that fixed the problem.
The next few days, I started seeing these symptoms I see are in my daily
and insecurity output...
################################################
Checking setuid files and devices:
Setuid/device find errors:
find: /dev/rwt8: Bad file descriptor
Device deletions:
crw-rw---- 1 root operator 10, 8 Dec 26 21:57:31 2004 /dev/rwt8
mtree: dev/rwt8: Bad file descriptor
################################################
Uptime: 3:15AM up 3 days, 33 mins, 1 user, load averages: 0.21, 0.41, 0.45
find: /usr/share/man/man9/psignal.9: Bad file descriptor
################################################
Perhaps someone can replicate this. Please let me know if there is
anything more I can do to test what might be the problem here. The
corruption seems minor - all my stuff still works (for now). But it does
worry me.
Any idea what could be causing this? Please let me know if I can provide
more information. Thanks,
Louis