Subject: Re: fs corruption with raidframe (again)
To: Greg Oster <oster@cs.usask.ca>
From: Louis Guillaume <lguillaume@berklee.edu>
List: netbsd-users
Date: 01/12/2005 01:36:40
Greg Oster wrote:
> Louis Guillaume writes:
>
>>Greg Oster wrote:
>>
>>>Louis Guillaume writes:
>
>>>
>>>Hmm... Have you used wd0 for anything else? Did it behave normally?
>>>Memory checks out fine? Do you see the corruption with a non-SMP
>>>kernel?
>>>
wd0 has been used as the "live component" of a broken RAID-1 in the past
when I experienced this. I have 3 disks - each of which have run the
system well on their own at one point or another.
>>>If you're feeling like an adventure: if you do a
>>>'raidctl -f /dev/wd1a raid0', do you still see the corruption?
>>>
Since my last message the system appeared to be running nicely. But I
decided to really check and see if things were ok...
1. run memtest86:
I ran this through only one iteration of the default
tests. (didn't have time to go through all)
Result - 100% passed.
2. Boot into single user, fsck everything. Only the /usr filesystem
appeared to have any serious issues. There were some "invalid
file type" (I think) errors and others. Anyway it fixed them
and went marked the fs clean.
3. Continue to multiuser...
Apache fails! Says "Redirect" is an invalid directive! Eek!
Spamd fails, dumps core. Ouch!
4. Reboot to single user...
Here's where I ran your suggested adventure command...
raidctl -f /dev/wd1a raid0
fsck -fy /usr (corrects a few errors again)
5. Continue to multiuser. All is well! Apache and spamd start
beautifully.
So at this point, I'm fairly satisfied that something's going on with
RaidFrame.
>
> After the 'raidctl -F component0', right?
Right.
> Hmmmm.... When you did
> the reboot, did you do it immediately after the above raidctl?
I don't think so. It would normally be my procedure to do so, but I
think I remember allowing the system to come up multiuser and everything
started up and ran for a while.
> Or did you do some other IO before rebooting?
Probably.
> And did you use
> 'shutdown -r' or something else?
shutdown -r now
> Was the RAID set 'dirty' after the
> reboot?
No. Everything was clean.
> Did fsck have to run?
No. filesystems were marked clean too.
>>I'll keep an eye on things. If I see more corruption tonight, I'll
>>attempt to "raidctl -f /dev/wd1a raid0" and see what happens.
>>Theoretically, if all hell breaks loose, I can revert to the other disk,
>>right?
>
>
> Yes. Another thing that could be done is to boot single-user, and
> see what 'fsck -f /dev/rraid0a' says when run a few times.
>
Anything special about specifying "/dev/rraid0a" vs. "/" ? I used the
filesystem names.
When running fsck a second time for /usr, it found more errors and fixed
them. A third fsck found nothing.
>
> I'm quite unable to replicate it :(
Do you have a similar machine running RAID-1?
What more can we look at to see what might be different about my machine
or its configuration?
Now I plan to let the machine run as it is now for at least a few days.
If no corruption shows up I'll be more convinced :) But I'd really like
to solve this problem, be it RaidFrame or something else.
Thanks for your help,
Louis