Subject: Re: fs corruption with raidframe (again)
To: Greg Oster <oster@cs.usask.ca>
From: Louis Guillaume <lguillaume@berklee.edu>
List: netbsd-users
Date: 01/12/2005 01:36:40
Greg Oster wrote:
> Louis Guillaume writes:
> 
>>Greg Oster wrote:
>>
>>>Louis Guillaume writes:
> 
>>>
>>>Hmm...  Have you used wd0 for anything else?  Did it behave normally?
>>>Memory checks out fine?  Do you see the corruption with a non-SMP 
>>>kernel?  
>>>

wd0 has been used as the "live component" of a broken RAID-1 in the past 
when I experienced this. I have 3 disks - each of which have run the 
system well on their own at one point or another.

>>>If you're feeling like an adventure:  if you do a
>>>'raidctl -f /dev/wd1a raid0', do you still see the corruption? 
>>>

Since my last message the system appeared to be running nicely. But I 
decided to really check and see if things were ok...

1. run memtest86:
    I ran this through only one iteration of the default
    tests. (didn't have time to go through all)
    Result - 100% passed.

2. Boot into single user, fsck everything. Only the /usr filesystem
    appeared to have any serious issues. There were some "invalid
    file type" (I think) errors and others. Anyway it fixed them
    and went marked the fs clean.

3. Continue to multiuser...
    Apache fails! Says "Redirect" is an invalid directive! Eek!
    Spamd fails, dumps core. Ouch!

4. Reboot to single user...
    Here's where I ran your suggested adventure command...
    raidctl -f /dev/wd1a raid0
    fsck -fy /usr (corrects a few errors again)

5. Continue to multiuser. All is well! Apache and spamd start
    beautifully.


So at this point, I'm fairly satisfied that something's going on with 
RaidFrame.


> 
> After the 'raidctl -F component0', right? 
Right.

> Hmmmm....  When you did 
> the reboot, did you do it immediately after the above raidctl?  

I don't think so. It would normally be my procedure to do so, but I 
think I remember allowing the system to come up multiuser and everything 
started up and ran for a while.

> Or did you do some other IO before rebooting?

Probably.

> And did you use 
> 'shutdown -r' or something else?

shutdown -r now

> Was the RAID set 'dirty' after the 
> reboot?

No. Everything was clean.

> Did fsck have to run?  

No. filesystems were marked clean too.

>>I'll keep an eye on things. If I see more corruption tonight, I'll 
>>attempt to "raidctl -f /dev/wd1a raid0" and see what happens. 
>>Theoretically, if all hell breaks loose, I can revert to the other disk, 
>>right?
> 
> 
> Yes.  Another thing that could be done is to boot single-user, and 
> see what 'fsck -f /dev/rraid0a' says when run a few times.  
> 

Anything special about specifying "/dev/rraid0a" vs. "/" ? I used the 
filesystem names.

When running fsck a second time for /usr, it found more errors and fixed 
them. A third fsck found nothing.

> 
> I'm quite unable to replicate it :(

Do you have a similar machine running RAID-1?
What more can we look at to see what might be different about my machine 
or its configuration?

Now I plan to let the machine run as it is now for at least a few days. 
If no corruption shows up I'll be more convinced :) But I'd really like 
to solve this problem, be it RaidFrame or something else.

Thanks for your help,

Louis