Subject: Re: fs corruption with raidframe (again)
To: Louis Guillaume <lguillaume@berklee.edu>
From: Greg Oster <oster@cs.usask.ca>
List: netbsd-users
Date: 01/10/2005 21:57:11
Louis Guillaume writes:
> Hi Everyone,
>
> Once again I have a suspicion that RaidFrame (RAID-1) is causing some
> file system corruption.
>
> I noticed this last year and posted here. But wasn't really able to do
> much troubleshooting. At that time I had a separate raid device for each
> filesystem and swap. After failing all the components on one of the
> disks, the machine ran fine for months. I sync'd the second component
> for the root partition, and it seemed fine. That also ran for months and
> months. A couple of times I attempted to bring on the "/usr" partition's
> second component, and all of a sudden I'd see file system corruption
> (similar to what's described below). As soon as that offending component
> was removed, all was well.
RAIDframe doesn't actually check the data that a component returns...
so if its given bogus data, it happily passes that on...
> Now I have reverted to my original scenario - One raid device for all
> file systems. As soon as the second component was added to the array, I
> began to see the problems...
>
> Here is the scenario...
[snip]
> ################################################
>
> After setting up raid0 on wd1, i.e. before syncing with wd0, the system
> ran without a hitch for several days.
>
> As soon as I sync'd with wd0
How did you do the sync?
> and rebooted, apache failed to start, as
> did spamd, as "/usr/pkg/lib/perl5/5.8.5/i386-netbsd/CORE/libperl.so" was
> corrupted.
wow. So the question is: what disk was it reading this from, and
where/how is the data getting scrambled? (rhetorical question..)
> I "make replace"'d the perl package and that fixed the problem.
>
> The next few days, I started seeing these symptoms I see are in my daily
> and insecurity output...
>
>
> ################################################
> Checking setuid files and devices:
> Setuid/device find errors:
> find: /dev/rwt8: Bad file descriptor
>
> Device deletions:
> crw-rw---- 1 root operator 10, 8 Dec 26 21:57:31 2004 /dev/rwt8
>
>
> mtree: dev/rwt8: Bad file descriptor
>
> ################################################
> Uptime: 3:15AM up 3 days, 33 mins, 1 user, load averages: 0.21, 0.41, 0.45
> find: /usr/share/man/man9/psignal.9: Bad file descriptor
>
> ################################################
>
>
>
> Perhaps someone can replicate this. Please let me know if there is
> anything more I can do to test what might be the problem here. The
> corruption seems minor - all my stuff still works (for now). But it does
> worry me.
Any corruption is bad/wrong.
> Any idea what could be causing this? Please let me know if I can provide
> more information. Thanks,
Hmm... Have you used wd0 for anything else? Did it behave normally?
Memory checks out fine? Do you see the corruption with a non-SMP
kernel?
If you're feeling like an adventure: if you do a
'raidctl -f /dev/wd1a raid0', do you still see the corruption?
Later...
Greg Oster