netbsd-users: Re: fs corruption with raidframe (again)

Subject: Re: fs corruption with raidframe (again)
To: Greg Oster <oster@cs.usask.ca>
From: Louis Guillaume <lguillaume@berklee.edu>
List: netbsd-users
Date: 01/13/2005 00:23:23
Greg Oster wrote:
> Louis Guillaume writes:
> 
>>Greg Oster wrote:
>>
>>>Louis Guillaume writes:
>>>
>>>
>>>>Greg Oster wrote:
>>>>
>>>>
>>>>>Louis Guillaume writes:
> 
> [snip]
> 
>>Since my last message the system appeared to be running nicely. But I 
>>decided to really check and see if things were ok...
>>
>>1. run memtest86:
>>    I ran this through only one iteration of the default
>>    tests. (didn't have time to go through all)
>>    Result - 100% passed.
> 
> 
> Good.  hopefully we can ignore memory issues.
> 
> 
>>2. Boot into single user, fsck everything. Only the /usr filesystem
>>    appeared to have any serious issues. There were some "invalid
>>    file type" (I think) errors and others. Anyway it fixed them
>>    and went marked the fs clean.
>>
>>3. Continue to multiuser...
>>    Apache fails! Says "Redirect" is an invalid directive! Eek!
>>    Spamd fails, dumps core. Ouch!
> 
> 
> Bizzare.
>  
> 
>>4. Reboot to single user...
>>    Here's where I ran your suggested adventure command...
>>    raidctl -f /dev/wd1a raid0
>>    fsck -fy /usr (corrects a few errors again)
>>
>>5. Continue to multiuser. All is well! Apache and spamd start
>>    beautifully.
> 
> 
> Hmmmm.
> 
> 
>>So at this point, I'm fairly satisfied that something's going on with 
>>RaidFrame.
> 
> 
> And/or its interaction with something.  The one thing that still 
> needs checking is the single vs. dual CPU thing.  I'd be quite 
> interested to know what happens with just a single-CPU kernel on that 
> box.
>  
> 
>>>After the 'raidctl -F component0', right? 
>>
>>Right.
>>
>>
>>>Hmmmm....  When you did 
>>>the reboot, did you do it immediately after the above raidctl?  
>>
>>I don't think so. It would normally be my procedure to do so, but I 
>>think I remember allowing the system to come up multiuser and everything 
>>started up and ran for a while.
>>
>>
>>>Or did you do some other IO before rebooting?
>>
>>Probably.
>>
>>
>>>And did you use 
>>>'shutdown -r' or something else?
>>
>>shutdown -r now
>>
>>
>>>Was the RAID set 'dirty' after the 
>>>reboot?
>>
>>No. Everything was clean.
>>
>>
>>>Did fsck have to run?  
>>
>>No. filesystems were marked clean too.
>>
>>
>>>>I'll keep an eye on things. If I see more corruption tonight, I'll 
>>>>attempt to "raidctl -f /dev/wd1a raid0" and see what happens. 
>>>>Theoretically, if all hell breaks loose, I can revert to the other disk, 
>>>>right?
>>>
>>>
>>>Yes.  Another thing that could be done is to boot single-user, and 
>>>see what 'fsck -f /dev/rraid0a' says when run a few times.  
>>>
>>
>>Anything special about specifying "/dev/rraid0a" vs. "/" ? I used the 
>>filesystem names.
> 
> 
> Well...  what does /etc/fstab say?  If things "match up" in there, 
> then no problem :)  Can you also send me (privately, if you wish) a 
> copy of all the disklabels...
> 

That's kind of the question. fstab has /dev/raid0a etc. not "rraid0a".

> 
>>When running fsck a second time for /usr, it found more errors and fixed 
>>them. A third fsck found nothing.
> 
> 
> Hmmm... 
> 
> 
>>>I'm quite unable to replicate it :(
>>
>>Do you have a similar machine running RAID-1?
> 
> 
> I have 6 or 7 boxes w/ RAID 1 in production environments, but I don't 
> think I have any with dual CPUs.  (That said, I did do a bunch of 
> testing w/ dual-CPU boxes quite some time ago, and didn't see any 
> issues..)
> 
> 
>>What more can we look at to see what might be different about my machine 
>>or its configuration?
> 
> 
> The number of CPUs seems to be something of interest right now.
> 

Right I'll try to schedule a bunch of tests, probably during the coming 
weekend.

> Oh... are your drives sharing an IDE cable?  
> 

No. Each drive is alone and on its own bus.

> 
>>Now I plan to let the machine run as it is now for at least a few days. 
>>If no corruption shows up I'll be more convinced :) But I'd really like 
>>to solve this problem, be it RaidFrame or something else.
> 
> 
> Verily. 
> 
> Some other ideas:
> 1) If you re-sync the RAID set, and then boot single-user (filesystems 
> mounted read-only), what happens when you do:
> 
>   foreach i (1 2 3 4)
>    dd if=/dev/rraid0d bs=1m count=1000 | md5
>   end
> 
> Do you see different md5 values?  If you have time, take out the 
> "count=1000" from above, and repeat.
> 
> 2) With the RAID set synced, and boot into single-user (filesystems 
> mounted read-only), what do you see when you do:
> 
>   foreach i (1 2 3 4)
>    dd if=/dev/rwd0a bs=1m skip=1 count=1000 | md5
>    dd if=/dev/rwd1a bs=1m skip=1 count=1000 | md5
>   end
> 
> Do you see different md5 values?  You can increase the count to 10000 
> or 100000 if you have time. (don't increase it so far as to go past 
> the portion of wd0a or wd1a that RAIDframe is using.. there may be 
> unused bits there that are not synced.  Also, the "skip=1" is 
> necessary to skip over the component labels (and more, in this case), 
> since those are known to be different).
> 
> And then, for more kicks, what does:
> 
>   foreach i (1 2 3 4)
>    dd if=/dev/rwd0a bs=1m skip=1 count=1000 | md5 & 
>    dd if=/dev/rwd1a bs=1m skip=1 count=1000 | md5 & 
>    sleep 20
>   end
> 
> say?  the md5 values should all agree, and should be the same as 
> the previous test.
> 


I'll try and run these tests over the weekend (or before if I find the 
opportunity).

Thanks very much...

Louis