Subject: Re: Possible serious bug in NetBSD-1.6.1_RC2
To: Greg Oster <oster@cs.usask.ca>
From: Brian Buhrow <buhrow@lothlorien.nfbcal.org>
List: current-users
Date: 03/11/2003 10:04:27
	Hello Greg.  If I understand your message correctly, then I have a
couple of questions and observations.

1.  According to sysctl, nkmempages is already at 8102.  This is about 33MB
of memory, if my calculations are correct.  Using the value 8192 would be
about 35MB of memory,not much more than is currently in use.  Is there a
limit to the number of pages I can allocate?  Must it be a power of 2?  In
case it helps with the sizing, right now, under normal operation, the
machine lasts 28-36 hours before it hangs or panics.  If I perform the
exercise I listed in the previous e-mail, it hangs immediately.

2.  I'm notcertain, but my guess is that the reason raid5 works under 1.5R
and not 1.6 is not so much due to the changes in raidframe itself, but,
rather, changes in the way kernel memory is managed by uvm.  They may be
only tuning changes, but something definitely changed.
-Brian
On Mar 11,  8:06am, Greg Oster wrote:
} Subject: Re: Possible serious bug in NetBSD-1.6.1_RC2
} Paul Ripke writes:
} > [ Greg, I've CC'ed you in on this, hope you don't mind, 
} 
} No problem...
} 
} > just in case you've
} > missed this thread, here's some more info. And, IMNSHO, an almost 100%
} > reproducible test case, if my understanding is anywhere close to the 
} > mark. ]
} > 
} > On Tuesday, Mar 11, 2003, at 21:18 Australia/Sydney, Brian Buhrow wrote:
} > 
} > > 	Hello Paul.  I believe it is the same bug you encountered.  I've a few
} > > observations about the bug, which I hope will be helpful in resolving 
} > > it.
} > >
} > > 1.  It's definitely related to the use of the raid5 level device, 
} > > including raw partitions.
} 
} RAID 5 is the hardest on kernel memory... and from both reports, it sounds 
} like the kernel running out of memory...
} 
} > > 2.  the bug exists in NetBSD 1.6, but not in 1.5R, early 2001 code.
} > 
} > Good to know - this will definitely help, I'm sure.
} 
} Interesting.  2001 was a "slow year" for RAIDframe development, so I'm not 
} sure why the problem wouldn't have been in 1.5R if it's in 1.6..
} 
} > > 3.  To reliably reproduce the hang:
} > > A.  Define a swap partition on your raid5 device.
} > >
} > > b.  Turn that partition onto the system with swapctl.
} > >
} > > C.  Watch the system go into the deep freezer when you try to link a 
} > > kernel
} > > with debugging symbols turned on and swap is needed.
} > 
} > Hmm... swap on RAID5... never thought of doing that. Think I've only 
} > ever
} > seen it mirrored... nup, correction, have seen swap on hardware RAID5 on
} > Tru64 with internal PCI RAID controller. OK, back to NetBSD, yes, I can
} > see how swap on RAIDframe RAID5 would exacerbate this problem!
} 
} There is a maximum amount of kernel memory that a RAID 5 set should use.
} (unless it has a leak, but I doubt that..)  The actual amount will depend on 
} stripe sizes, number of partial stripe writes, and stuff like that.
}  
} > > 4.  Softdep exaserbates the problem, but, it's not softdep which is to
} > > blame here.
} > >
} > > 	Has Greg indicated whether or not he has any ideas on the matter?
} > 
} > I'm CC'ing Greg in, I have a hunch he understands the problem, but the
} > fix is more a design problem than bug squashing. Greg, correct me if I'm
} > wrong...
} 
} The RAID code (like some other kernel code) doesn't handle "no memory" 
} conditions very gracefully.  Making it handle "no memory" conditions 
} gracefully is on my list, but requires some Major Changes to RAIDframe code.
} 
} A couple of things to try:
} 1) Bump up the amount of memory your kernel has using:
} 
}  options NKMEMPAGES=8192
} 
} in your kernel config file.  (yes, that should give you *LOTS* of kernel 
} memory, but if the problem still happens with that much, then it really is 
} more than just an "out of kernel memory" problem.)
} 
} 2) Add the following option to your kernel config:
} 
}  options RAIDOUTSTANDING=3
} 
} This will limit the amount of IO going to each RAID set.  
} 
} If my guesses are correct, 1) will be more effective than 2), but a kernel 
} using 2) should run for quite a bit longer...
} 
} Later...
} 
} Greg Oster
} 
} 
>-- End of excerpt from Greg Oster