Subject: Re: fs corruption with raidframe (again)
To: Greg Oster <oster@cs.usask.ca>
From: Louis Guillaume <lguillaume@berklee.edu>
List: netbsd-users
Date: 01/13/2005 00:23:23
Greg Oster wrote:
> Louis Guillaume writes:
>
>>Greg Oster wrote:
>>
>>>Louis Guillaume writes:
>>>
>>>
>>>>Greg Oster wrote:
>>>>
>>>>
>>>>>Louis Guillaume writes:
>
> [snip]
>
>>Since my last message the system appeared to be running nicely. But I
>>decided to really check and see if things were ok...
>>
>>1. run memtest86:
>> I ran this through only one iteration of the default
>> tests. (didn't have time to go through all)
>> Result - 100% passed.
>
>
> Good. hopefully we can ignore memory issues.
>
>
>>2. Boot into single user, fsck everything. Only the /usr filesystem
>> appeared to have any serious issues. There were some "invalid
>> file type" (I think) errors and others. Anyway it fixed them
>> and went marked the fs clean.
>>
>>3. Continue to multiuser...
>> Apache fails! Says "Redirect" is an invalid directive! Eek!
>> Spamd fails, dumps core. Ouch!
>
>
> Bizzare.
>
>
>>4. Reboot to single user...
>> Here's where I ran your suggested adventure command...
>> raidctl -f /dev/wd1a raid0
>> fsck -fy /usr (corrects a few errors again)
>>
>>5. Continue to multiuser. All is well! Apache and spamd start
>> beautifully.
>
>
> Hmmmm.
>
>
>>So at this point, I'm fairly satisfied that something's going on with
>>RaidFrame.
>
>
> And/or its interaction with something. The one thing that still
> needs checking is the single vs. dual CPU thing. I'd be quite
> interested to know what happens with just a single-CPU kernel on that
> box.
>
>
>>>After the 'raidctl -F component0', right?
>>
>>Right.
>>
>>
>>>Hmmmm.... When you did
>>>the reboot, did you do it immediately after the above raidctl?
>>
>>I don't think so. It would normally be my procedure to do so, but I
>>think I remember allowing the system to come up multiuser and everything
>>started up and ran for a while.
>>
>>
>>>Or did you do some other IO before rebooting?
>>
>>Probably.
>>
>>
>>>And did you use
>>>'shutdown -r' or something else?
>>
>>shutdown -r now
>>
>>
>>>Was the RAID set 'dirty' after the
>>>reboot?
>>
>>No. Everything was clean.
>>
>>
>>>Did fsck have to run?
>>
>>No. filesystems were marked clean too.
>>
>>
>>>>I'll keep an eye on things. If I see more corruption tonight, I'll
>>>>attempt to "raidctl -f /dev/wd1a raid0" and see what happens.
>>>>Theoretically, if all hell breaks loose, I can revert to the other disk,
>>>>right?
>>>
>>>
>>>Yes. Another thing that could be done is to boot single-user, and
>>>see what 'fsck -f /dev/rraid0a' says when run a few times.
>>>
>>
>>Anything special about specifying "/dev/rraid0a" vs. "/" ? I used the
>>filesystem names.
>
>
> Well... what does /etc/fstab say? If things "match up" in there,
> then no problem :) Can you also send me (privately, if you wish) a
> copy of all the disklabels...
>
That's kind of the question. fstab has /dev/raid0a etc. not "rraid0a".
>
>>When running fsck a second time for /usr, it found more errors and fixed
>>them. A third fsck found nothing.
>
>
> Hmmm...
>
>
>>>I'm quite unable to replicate it :(
>>
>>Do you have a similar machine running RAID-1?
>
>
> I have 6 or 7 boxes w/ RAID 1 in production environments, but I don't
> think I have any with dual CPUs. (That said, I did do a bunch of
> testing w/ dual-CPU boxes quite some time ago, and didn't see any
> issues..)
>
>
>>What more can we look at to see what might be different about my machine
>>or its configuration?
>
>
> The number of CPUs seems to be something of interest right now.
>
Right I'll try to schedule a bunch of tests, probably during the coming
weekend.
> Oh... are your drives sharing an IDE cable?
>
No. Each drive is alone and on its own bus.
>
>>Now I plan to let the machine run as it is now for at least a few days.
>>If no corruption shows up I'll be more convinced :) But I'd really like
>>to solve this problem, be it RaidFrame or something else.
>
>
> Verily.
>
> Some other ideas:
> 1) If you re-sync the RAID set, and then boot single-user (filesystems
> mounted read-only), what happens when you do:
>
> foreach i (1 2 3 4)
> dd if=/dev/rraid0d bs=1m count=1000 | md5
> end
>
> Do you see different md5 values? If you have time, take out the
> "count=1000" from above, and repeat.
>
> 2) With the RAID set synced, and boot into single-user (filesystems
> mounted read-only), what do you see when you do:
>
> foreach i (1 2 3 4)
> dd if=/dev/rwd0a bs=1m skip=1 count=1000 | md5
> dd if=/dev/rwd1a bs=1m skip=1 count=1000 | md5
> end
>
> Do you see different md5 values? You can increase the count to 10000
> or 100000 if you have time. (don't increase it so far as to go past
> the portion of wd0a or wd1a that RAIDframe is using.. there may be
> unused bits there that are not synced. Also, the "skip=1" is
> necessary to skip over the component labels (and more, in this case),
> since those are known to be different).
>
> And then, for more kicks, what does:
>
> foreach i (1 2 3 4)
> dd if=/dev/rwd0a bs=1m skip=1 count=1000 | md5 &
> dd if=/dev/rwd1a bs=1m skip=1 count=1000 | md5 &
> sleep 20
> end
>
> say? the md5 values should all agree, and should be the same as
> the previous test.
>
I'll try and run these tests over the weekend (or before if I find the
opportunity).
Thanks very much...
Louis