netbsd-users: Re: fs corruption with raidframe (again)

Subject: Re: fs corruption with raidframe (again)
To: Louis Guillaume <lguillaume@berklee.edu>
From: Greg Oster <oster@cs.usask.ca>
List: netbsd-users
Date: 01/11/2005 07:39:25
Louis Guillaume writes:
> Greg Oster wrote:
> > Louis Guillaume writes:
[snip]
> > 
> >>and rebooted, apache failed to start, as 
> >>did spamd, as "/usr/pkg/lib/perl5/5.8.5/i386-netbsd/CORE/libperl.so" was 
> >>corrupted.
> > 
> > 
> > wow.  So the question is: what disk was it reading this from, and 
> > where/how is the data getting scrambled?  (rhetorical question..)
> > 
> > 
> 
> A bad disk or bus? That was my initial suspicion the last time this 
> happened. But I think I've disproved this theory. See below...

I'm not convinced yet :-}

[snip]
> > 
> >>Any idea what could be causing this? Please let me know if I can provide 
> >>more information. Thanks,
> > 
> > 
> > Hmm...  Have you used wd0 for anything else?  Did it behave normally?
> > Memory checks out fine?  Do you see the corruption with a non-SMP 
> > kernel?  
> > 
> > If you're feeling like an adventure:  if you do a
> > 'raidctl -f /dev/wd1a raid0', do you still see the corruption? 
> > 
> 
> That disk has had the system on it. Actually there are 3 disks total 
> that I've been swapping around for a while. Currently the cold spare 
> disk is the original disk that I thought may have been the problem.
> 
> I'm going to let things go for now as-is. I suspect that the corruption 
> happens when a lot of data is written/read to/from the partitions in 
> question. The first time I noticed this, was after a binary upgrade 
> (pax). Here it happens again after syncing the raid components.

After the 'raidctl -F component0', right?  Hmmmm....  When you did 
the reboot, did you do it immediately after the above raidctl?  
Or did you do some other IO before rebooting?  And did you use 
'shutdown -r' or something else?  Was the RAID set 'dirty' after the 
reboot?  Did fsck have to run?  

> I'll keep an eye on things. If I see more corruption tonight, I'll 
> attempt to "raidctl -f /dev/wd1a raid0" and see what happens. 
> Theoretically, if all hell breaks loose, I can revert to the other disk, 
> right?

Yes.  Another thing that could be done is to boot single-user, and 
see what 'fsck -f /dev/rraid0a' says when run a few times.  

> I'd really like to try and replicate this problem, so I'm going to try 
> and find a machine at work to test with if I have time.

I'm quite unable to replicate it :(

Oh... and you *have* run memtest86, right?  

Later...

Greg Oster