netbsd-users: Re: fs corruption with raidframe (again)

Subject: Re: fs corruption with raidframe (again)
To: Greg Oster <oster@cs.usask.ca>
From: Louis Guillaume <lguillaume@berklee.edu>
List: netbsd-users
Date: 01/11/2005 00:29:31
Greg Oster wrote:
> Louis Guillaume writes:
> 
>>Hi Everyone,
>>
>>Once again I have a suspicion that RaidFrame (RAID-1) is causing some 
>>file system corruption.
>>
>>I noticed this last year and posted here. But wasn't really able to do 
>>much troubleshooting. At that time I had a separate raid device for each 
>>filesystem and swap. After failing all the components on one of the 
>>disks, the machine ran fine for months. I sync'd the second component 
>>for the root partition, and it seemed fine. That also ran for months and 
>>months. A couple of times I attempted to bring on the "/usr" partition's 
>>second component, and all of a sudden I'd see file system corruption 
>>(similar to what's described below). As soon as that offending component 
>>was removed, all was well.
> 
> 
> RAIDframe doesn't actually check the data that a component returns... 
> so if its given bogus data, it happily passes that on... 
> 

Right. But the bogus data only seems to crop up when there is a raid 
situation.

> 
>>Now I have reverted to my original scenario - One raid device for all 
>>file systems. As soon as the second component was added to the array, I 
>>began to see the problems...
>>
>>Here is the scenario...
> 
> [snip]
> 
>>################################################
>>
>>After setting up raid0 on wd1, i.e. before syncing with wd0, the system 
>>ran without a hitch for several days.
>>
>>As soon as I sync'd with wd0 
> 
> 
> How did you do the sync?

pax -Xrwvpe ... ...

> 
>>and rebooted, apache failed to start, as 
>>did spamd, as "/usr/pkg/lib/perl5/5.8.5/i386-netbsd/CORE/libperl.so" was 
>>corrupted.
> 
> 
> wow.  So the question is: what disk was it reading this from, and 
> where/how is the data getting scrambled?  (rhetorical question..)
> 
> 

A bad disk or bus? That was my initial suspicion the last time this 
happened. But I think I've disproved this theory. See below...

>>I "make replace"'d the perl package and that fixed the problem.
>>
>>The next few days, I started seeing these symptoms I see are in my daily 
>>and insecurity output...
>>
>>
>>################################################
>>Checking setuid files and devices:
>>Setuid/device find errors:
>>find: /dev/rwt8: Bad file descriptor
>>
>>Device deletions:
>>crw-rw---- 1 root operator 10, 8 Dec 26 21:57:31 2004 /dev/rwt8
>>
>>
>>mtree: dev/rwt8: Bad file descriptor
>>
>>################################################
>>Uptime:  3:15AM up 3 days, 33 mins, 1 user, load averages: 0.21, 0.41, 0.45
>>find: /usr/share/man/man9/psignal.9: Bad file descriptor
>>
>>################################################
>>
>>
>>
>>Perhaps someone can replicate this. Please let me know if there is 
>>anything more I can do to test what might be the problem here. The 
>>corruption seems minor - all my stuff still works (for now). But it does 
>>worry me.
> 
> 
> Any corruption is bad/wrong.
> 

indeed.

> 
>>Any idea what could be causing this? Please let me know if I can provide 
>>more information. Thanks,
> 
> 
> Hmm...  Have you used wd0 for anything else?  Did it behave normally?
> Memory checks out fine?  Do you see the corruption with a non-SMP 
> kernel?  
> 
> If you're feeling like an adventure:  if you do a
> 'raidctl -f /dev/wd1a raid0', do you still see the corruption? 
> 

That disk has had the system on it. Actually there are 3 disks total 
that I've been swapping around for a while. Currently the cold spare 
disk is the original disk that I thought may have been the problem.

I'm going to let things go for now as-is. I suspect that the corruption 
happens when a lot of data is written/read to/from the partitions in 
question. The first time I noticed this, was after a binary upgrade 
(pax). Here it happens again after syncing the raid components.

I'll keep an eye on things. If I see more corruption tonight, I'll 
attempt to "raidctl -f /dev/wd1a raid0" and see what happens. 
Theoretically, if all hell breaks loose, I can revert to the other disk, 
right?

I'd really like to try and replicate this problem, so I'm going to try 
and find a machine at work to test with if I have time.


Louis