Subject: Re: Possible serious bug in NetBSD-1.6.1_RC2
To: Brian Buhrow <buhrow@lothlorien.nfbcal.org>
From: Manuel Bouyer <bouyer@antioche.eu.org>
List: tech-kern
Date: 04/08/2003 22:58:31
On Tue, Apr 08, 2003 at 12:17:09AM -0700, Brian Buhrow wrote:
> 	Hello folks.  I find it very curious that no one else is seeing the
> same problems I am with NetBSD-1.6X.  I've narrowed things down to 2
> problems:
> 
> 1.  Periodically, usually during heavy i/o activity, the machine panics
> with a uvm_fault indicating an invalid page table.  
> 
> 2.  The machine hangs with one or more processes in "flt_pmfail[12]".
> 
> 	In response to Greg's discovery that paging to a raid5 swap area
> causes hangs, I changed my configuration to only swap and page to a single
> disk.  This change does not change the behavior of my machine at all.  Once
> it begins using the paging area on the disk, it won't be long until a hang
> occurs.

Hum, I have some systems which are paging a lot, and don't have any problems
with them. I've got one crashing during a crash test, but it has
ran for 12 hours at a load average of about 300, with ram+swap full at 99.99%
(new process creation was balanced with the system killing processes because
it was out of swap). I don't remember the details but the crash was probably
related to one of the problems left with ram resources shortage (it was
a panic, not a uvm_fault). Not something to worry in normal use.

> 
> 	I've captured several core files from this hanging process, and would
> be happy to provide details for anyone who might be able to help shed light
> on the problem.
> 
> 	I've captured core files which demonstrate both problems, and am
> willing to try and troubleshoot this problem further if anyone can provide
> guidence.  I have 1.6.1 sources as of April 4, 2003 and I have a full
> symbol table copy of the kernel ready to trace with gdb, ps, vmstat, or
> what ever.
> 
> 	Right now, the machine will not stay up more than 24 hours, and,
> usually, it crashes due to one of the problems within twelve hours of a
> restart.  
> 	I'm usually pretty good at tracking down problems, but this one seems
> pretty thorney, and, I confess, I'm getting pretty frustrated.  Would
> someone be willing to help me troubleshoot this problem further?  I'm happy
> to provide any details, images, moral support, free beer, what ever.  I've
> been using NetBSD for 10 years, and, in fact, this machine is supposed to
> be replacing my ancient NetBSD 0.9A system, but so far, that server is
> still more reliable than this shiny new 1.6.1 system.

Can you give more details about the hardware, and also on what the machine is
doing ? Obviously it's not a but which is triggererd in common use.
Also, did you consider a hardware problem ? Do you have another machine
to test in the same workload ?

-- 
Manuel Bouyer <bouyer@antioche.eu.org>
     NetBSD: 24 ans d'experience feront toujours la difference
--