netbsd-users: Re: fs corruption with raidframe (again)

Subject: Re: fs corruption with raidframe (again)
To: Louis Guillaume <lguillaume@berklee.edu>
From: Greg Oster <oster@cs.usask.ca>
List: netbsd-users
Date: 01/12/2005 09:03:29
Louis Guillaume writes:
> Greg Oster wrote:
> > Louis Guillaume writes:
> > 
> >>Greg Oster wrote:
> >>
> >>>Louis Guillaume writes:
[snip]
> 
> Since my last message the system appeared to be running nicely. But I 
> decided to really check and see if things were ok...
> 
> 1. run memtest86:
>     I ran this through only one iteration of the default
>     tests. (didn't have time to go through all)
>     Result - 100% passed.

Good.  hopefully we can ignore memory issues.

> 2. Boot into single user, fsck everything. Only the /usr filesystem
>     appeared to have any serious issues. There were some "invalid
>     file type" (I think) errors and others. Anyway it fixed them
>     and went marked the fs clean.
> 
> 3. Continue to multiuser...
>     Apache fails! Says "Redirect" is an invalid directive! Eek!
>     Spamd fails, dumps core. Ouch!

Bizzare.
 
> 4. Reboot to single user...
>     Here's where I ran your suggested adventure command...
>     raidctl -f /dev/wd1a raid0
>     fsck -fy /usr (corrects a few errors again)
> 
> 5. Continue to multiuser. All is well! Apache and spamd start
>     beautifully.

Hmmmm.

> So at this point, I'm fairly satisfied that something's going on with 
> RaidFrame.

And/or its interaction with something.  The one thing that still 
needs checking is the single vs. dual CPU thing.  I'd be quite 
interested to know what happens with just a single-CPU kernel on that 
box.
 
> > 
> > After the 'raidctl -F component0', right? 
> Right.
> 
> > Hmmmm....  When you did 
> > the reboot, did you do it immediately after the above raidctl?  
> 
> I don't think so. It would normally be my procedure to do so, but I 
> think I remember allowing the system to come up multiuser and everything 
> started up and ran for a while.
> 
> > Or did you do some other IO before rebooting?
> 
> Probably.
> 
> > And did you use 
> > 'shutdown -r' or something else?
> 
> shutdown -r now
> 
> > Was the RAID set 'dirty' after the 
> > reboot?
> 
> No. Everything was clean.
> 
> > Did fsck have to run?  
> 
> No. filesystems were marked clean too.
> 
> >>I'll keep an eye on things. If I see more corruption tonight, I'll 
> >>attempt to "raidctl -f /dev/wd1a raid0" and see what happens. 
> >>Theoretically, if all hell breaks loose, I can revert to the other disk, 
> >>right?
> > 
> > 
> > Yes.  Another thing that could be done is to boot single-user, and 
> > see what 'fsck -f /dev/rraid0a' says when run a few times.  
> > 
> 
> Anything special about specifying "/dev/rraid0a" vs. "/" ? I used the 
> filesystem names.

Well...  what does /etc/fstab say?  If things "match up" in there, 
then no problem :)  Can you also send me (privately, if you wish) a 
copy of all the disklabels...

> When running fsck a second time for /usr, it found more errors and fixed 
> them. A third fsck found nothing.

Hmmm... 

> > 
> > I'm quite unable to replicate it :(
> 
> Do you have a similar machine running RAID-1?

I have 6 or 7 boxes w/ RAID 1 in production environments, but I don't 
think I have any with dual CPUs.  (That said, I did do a bunch of 
testing w/ dual-CPU boxes quite some time ago, and didn't see any 
issues..)

> What more can we look at to see what might be different about my machine 
> or its configuration?

The number of CPUs seems to be something of interest right now.

Oh... are your drives sharing an IDE cable?  

> Now I plan to let the machine run as it is now for at least a few days. 
> If no corruption shows up I'll be more convinced :) But I'd really like 
> to solve this problem, be it RaidFrame or something else.

Verily. 

Some other ideas:
1) If you re-sync the RAID set, and then boot single-user (filesystems 
mounted read-only), what happens when you do:

  foreach i (1 2 3 4)
   dd if=/dev/rraid0d bs=1m count=1000 | md5
  end

Do you see different md5 values?  If you have time, take out the 
"count=1000" from above, and repeat.

2) With the RAID set synced, and boot into single-user (filesystems 
mounted read-only), what do you see when you do:

  foreach i (1 2 3 4)
   dd if=/dev/rwd0a bs=1m skip=1 count=1000 | md5
   dd if=/dev/rwd1a bs=1m skip=1 count=1000 | md5
  end

Do you see different md5 values?  You can increase the count to 10000 
or 100000 if you have time. (don't increase it so far as to go past 
the portion of wd0a or wd1a that RAIDframe is using.. there may be 
unused bits there that are not synced.  Also, the "skip=1" is 
necessary to skip over the component labels (and more, in this case), 
since those are known to be different).

And then, for more kicks, what does:

  foreach i (1 2 3 4)
   dd if=/dev/rwd0a bs=1m skip=1 count=1000 | md5 & 
   dd if=/dev/rwd1a bs=1m skip=1 count=1000 | md5 & 
   sleep 20
  end

say?  the md5 values should all agree, and should be the same as 
the previous test.

Later...

Greg Oster