Subject: Re: fs corruption with raidframe (again)
To: Greg Oster <oster@cs.usask.ca>
From: Louis Guillaume <lguillaume@berklee.edu>
List: netbsd-users
Date: 01/11/2005 00:29:31
Greg Oster wrote:
> Louis Guillaume writes:
>
>>Hi Everyone,
>>
>>Once again I have a suspicion that RaidFrame (RAID-1) is causing some
>>file system corruption.
>>
>>I noticed this last year and posted here. But wasn't really able to do
>>much troubleshooting. At that time I had a separate raid device for each
>>filesystem and swap. After failing all the components on one of the
>>disks, the machine ran fine for months. I sync'd the second component
>>for the root partition, and it seemed fine. That also ran for months and
>>months. A couple of times I attempted to bring on the "/usr" partition's
>>second component, and all of a sudden I'd see file system corruption
>>(similar to what's described below). As soon as that offending component
>>was removed, all was well.
>
>
> RAIDframe doesn't actually check the data that a component returns...
> so if its given bogus data, it happily passes that on...
>
Right. But the bogus data only seems to crop up when there is a raid
situation.
>
>>Now I have reverted to my original scenario - One raid device for all
>>file systems. As soon as the second component was added to the array, I
>>began to see the problems...
>>
>>Here is the scenario...
>
> [snip]
>
>>################################################
>>
>>After setting up raid0 on wd1, i.e. before syncing with wd0, the system
>>ran without a hitch for several days.
>>
>>As soon as I sync'd with wd0
>
>
> How did you do the sync?
pax -Xrwvpe ... ...
>
>>and rebooted, apache failed to start, as
>>did spamd, as "/usr/pkg/lib/perl5/5.8.5/i386-netbsd/CORE/libperl.so" was
>>corrupted.
>
>
> wow. So the question is: what disk was it reading this from, and
> where/how is the data getting scrambled? (rhetorical question..)
>
>
A bad disk or bus? That was my initial suspicion the last time this
happened. But I think I've disproved this theory. See below...
>>I "make replace"'d the perl package and that fixed the problem.
>>
>>The next few days, I started seeing these symptoms I see are in my daily
>>and insecurity output...
>>
>>
>>################################################
>>Checking setuid files and devices:
>>Setuid/device find errors:
>>find: /dev/rwt8: Bad file descriptor
>>
>>Device deletions:
>>crw-rw---- 1 root operator 10, 8 Dec 26 21:57:31 2004 /dev/rwt8
>>
>>
>>mtree: dev/rwt8: Bad file descriptor
>>
>>################################################
>>Uptime: 3:15AM up 3 days, 33 mins, 1 user, load averages: 0.21, 0.41, 0.45
>>find: /usr/share/man/man9/psignal.9: Bad file descriptor
>>
>>################################################
>>
>>
>>
>>Perhaps someone can replicate this. Please let me know if there is
>>anything more I can do to test what might be the problem here. The
>>corruption seems minor - all my stuff still works (for now). But it does
>>worry me.
>
>
> Any corruption is bad/wrong.
>
indeed.
>
>>Any idea what could be causing this? Please let me know if I can provide
>>more information. Thanks,
>
>
> Hmm... Have you used wd0 for anything else? Did it behave normally?
> Memory checks out fine? Do you see the corruption with a non-SMP
> kernel?
>
> If you're feeling like an adventure: if you do a
> 'raidctl -f /dev/wd1a raid0', do you still see the corruption?
>
That disk has had the system on it. Actually there are 3 disks total
that I've been swapping around for a while. Currently the cold spare
disk is the original disk that I thought may have been the problem.
I'm going to let things go for now as-is. I suspect that the corruption
happens when a lot of data is written/read to/from the partitions in
question. The first time I noticed this, was after a binary upgrade
(pax). Here it happens again after syncing the raid components.
I'll keep an eye on things. If I see more corruption tonight, I'll
attempt to "raidctl -f /dev/wd1a raid0" and see what happens.
Theoretically, if all hell breaks loose, I can revert to the other disk,
right?
I'd really like to try and replicate this problem, so I'm going to try
and find a machine at work to test with if I have time.
Louis