current-users: Re: Possible serious bug in NetBSD-1.6.1

Subject: Re: Possible serious bug in NetBSD-1.6.1_RC2
To: Greg Oster <oster@cs.usask.ca>
From: Brian Buhrow <buhrow@lothlorien.nfbcal.org>
List: current-users
Date: 03/12/2003 02:58:49
	Hello Greg.  I tried a similar experiment here, and found the same
result.  Swapping to a raw disk partition, as opposed to a raid partition, works
fine, even if the rest of the filesystem i/o is going through the raid.
	this experience triggered a memory I had using raidframe under 1.5R.
On that system, I could not run fsck -n on a mounted filesystem if the
filesystem sat on a raid5 device because doing so would cause uvm to go
into a tight loop claiming it was performing deadlock avoidance.  When the
machine died this evening while I was paging to the raid 5 device, I
noticed that it died gradually, as if some resource had become unavailable
and then, as more things began to need that resource, things coasted to a
halt.
	In thinking about this some more, I'm pretty sure that the problem
isn't alack of kernel memory, as we previously thought.  Vmstat -m
consistently showed only 3MB of kernel memory in use during the entire run
up to death.  I'm thinking that perhaps there is a problem with the way
physio() interacts with the locks you use to lock the indivividual
components of the raid5 set.  There's definitely a lot of calling around
inside the kernel while locks are set, and I wonder if the i/o pattern
raid5 uses tends to excercise the locking algorithm more heavily than do
the raid0 and raid1 i/o patterns.  Or, alternatively, if the raid5 pattern
causes uvm to lock itself up.
	Note that I'm merely speculating based on relatively little knowledge
about how the internals of raidframe and uvm work.  I'm merely trying to
make some, hopefully, reasonable suggestions about where to look.
	Good luck with the testing.  If there's anything I can help with to
further the process of squashing this little annoyance, let me know.  I
love raidframe, and I'd love to see it work completely as one would expect.
-Brian

On Mar 11, 10:57pm, Greg Oster wrote:
} Subject: Re: Possible serious bug in NetBSD-1.6.1_RC2
} Brian Buhrow writes:
} > 2.  I'm notcertain, but my guess is that the reason raid5 works under 1.5R
} > and not 1.6 is not so much due to the changes in raidframe itself, but,
} > rather, changes in the way kernel memory is managed by uvm.  They may be
} > only tuning changes, but something definitely changed.
} 
} So whatever the problem is, I can reliably lockup my test box in seconds
} when swapping to a RAID 5 set... Using the same test I couldn't kill my 
} test box when swapping to a RAID 1 set.  
} 
} Workaround to the problem: Swap to a RAID 1 set instead of a RAID 5 set.  
} While reads might not be as fast, writes will be (likely) 4x faster, which 
} should improve your swapping/paging performance too!
} 
} Figuring out what the problem is will require some additional testing,
} and that will have to wait until at least tomorrow... :-}
} 
} Thanks to everyone who helped narrow down the problem...
} 
} Later...
} 
} Greg Oster
} 
} 
} 
} 
>-- End of excerpt from Greg Oster