netbsd-users: Re: fs corruption with raidframe (again)

Subject: Re: fs corruption with raidframe (again)
To: Louis Guillaume <lguillaume@berklee.edu>
From: Greg Oster <oster@cs.usask.ca>
List: netbsd-users
Date: 01/10/2005 21:57:11
Louis Guillaume writes:
> Hi Everyone,
> 
> Once again I have a suspicion that RaidFrame (RAID-1) is causing some 
> file system corruption.
>
> I noticed this last year and posted here. But wasn't really able to do 
> much troubleshooting. At that time I had a separate raid device for each 
> filesystem and swap. After failing all the components on one of the 
> disks, the machine ran fine for months. I sync'd the second component 
> for the root partition, and it seemed fine. That also ran for months and 
> months. A couple of times I attempted to bring on the "/usr" partition's 
> second component, and all of a sudden I'd see file system corruption 
> (similar to what's described below). As soon as that offending component 
> was removed, all was well.

RAIDframe doesn't actually check the data that a component returns... 
so if its given bogus data, it happily passes that on... 

> Now I have reverted to my original scenario - One raid device for all 
> file systems. As soon as the second component was added to the array, I 
> began to see the problems...
> 
> Here is the scenario...
[snip]
> ################################################
> 
> After setting up raid0 on wd1, i.e. before syncing with wd0, the system 
> ran without a hitch for several days.
> 
> As soon as I sync'd with wd0 

How did you do the sync?

> and rebooted, apache failed to start, as 
> did spamd, as "/usr/pkg/lib/perl5/5.8.5/i386-netbsd/CORE/libperl.so" was 
> corrupted.

wow.  So the question is: what disk was it reading this from, and 
where/how is the data getting scrambled?  (rhetorical question..)

> I "make replace"'d the perl package and that fixed the problem.
> 
> The next few days, I started seeing these symptoms I see are in my daily 
> and insecurity output...
> 
> 
> ################################################
> Checking setuid files and devices:
> Setuid/device find errors:
> find: /dev/rwt8: Bad file descriptor
> 
> Device deletions:
> crw-rw---- 1 root operator 10, 8 Dec 26 21:57:31 2004 /dev/rwt8
> 
> 
> mtree: dev/rwt8: Bad file descriptor
> 
> ################################################
> Uptime:  3:15AM up 3 days, 33 mins, 1 user, load averages: 0.21, 0.41, 0.45
> find: /usr/share/man/man9/psignal.9: Bad file descriptor
> 
> ################################################
> 
> 
> 
> Perhaps someone can replicate this. Please let me know if there is 
> anything more I can do to test what might be the problem here. The 
> corruption seems minor - all my stuff still works (for now). But it does 
> worry me.

Any corruption is bad/wrong.

> Any idea what could be causing this? Please let me know if I can provide 
> more information. Thanks,

Hmm...  Have you used wd0 for anything else?  Did it behave normally?
Memory checks out fine?  Do you see the corruption with a non-SMP 
kernel?  

If you're feeling like an adventure:  if you do a
'raidctl -f /dev/wd1a raid0', do you still see the corruption? 


Later...

Greg Oster