Subject: Re: fs corruption with raidframe (again)
To: Louis Guillaume <lguillaume@berklee.edu>
From: Greg Oster <oster@cs.usask.ca>
List: netbsd-users
Date: 01/11/2005 07:39:25
Louis Guillaume writes:
> Greg Oster wrote:
> > Louis Guillaume writes:
[snip]
> >
> >>and rebooted, apache failed to start, as
> >>did spamd, as "/usr/pkg/lib/perl5/5.8.5/i386-netbsd/CORE/libperl.so" was
> >>corrupted.
> >
> >
> > wow. So the question is: what disk was it reading this from, and
> > where/how is the data getting scrambled? (rhetorical question..)
> >
> >
>
> A bad disk or bus? That was my initial suspicion the last time this
> happened. But I think I've disproved this theory. See below...
I'm not convinced yet :-}
[snip]
> >
> >>Any idea what could be causing this? Please let me know if I can provide
> >>more information. Thanks,
> >
> >
> > Hmm... Have you used wd0 for anything else? Did it behave normally?
> > Memory checks out fine? Do you see the corruption with a non-SMP
> > kernel?
> >
> > If you're feeling like an adventure: if you do a
> > 'raidctl -f /dev/wd1a raid0', do you still see the corruption?
> >
>
> That disk has had the system on it. Actually there are 3 disks total
> that I've been swapping around for a while. Currently the cold spare
> disk is the original disk that I thought may have been the problem.
>
> I'm going to let things go for now as-is. I suspect that the corruption
> happens when a lot of data is written/read to/from the partitions in
> question. The first time I noticed this, was after a binary upgrade
> (pax). Here it happens again after syncing the raid components.
After the 'raidctl -F component0', right? Hmmmm.... When you did
the reboot, did you do it immediately after the above raidctl?
Or did you do some other IO before rebooting? And did you use
'shutdown -r' or something else? Was the RAID set 'dirty' after the
reboot? Did fsck have to run?
> I'll keep an eye on things. If I see more corruption tonight, I'll
> attempt to "raidctl -f /dev/wd1a raid0" and see what happens.
> Theoretically, if all hell breaks loose, I can revert to the other disk,
> right?
Yes. Another thing that could be done is to boot single-user, and
see what 'fsck -f /dev/rraid0a' says when run a few times.
> I'd really like to try and replicate this problem, so I'm going to try
> and find a machine at work to test with if I have time.
I'm quite unable to replicate it :(
Oh... and you *have* run memtest86, right?
Later...
Greg Oster