Subject: Re: Possible serious bug in NetBSD-1.6.1_RC2
To: Brian Buhrow <buhrow@lothlorien.nfbcal.org>
From: Manuel Bouyer <bouyer@antioche.eu.org>
List: tech-kern
Date: 04/08/2003 22:58:31
On Tue, Apr 08, 2003 at 12:17:09AM -0700, Brian Buhrow wrote:
> Hello folks. I find it very curious that no one else is seeing the
> same problems I am with NetBSD-1.6X. I've narrowed things down to 2
> problems:
>
> 1. Periodically, usually during heavy i/o activity, the machine panics
> with a uvm_fault indicating an invalid page table.
>
> 2. The machine hangs with one or more processes in "flt_pmfail[12]".
>
> In response to Greg's discovery that paging to a raid5 swap area
> causes hangs, I changed my configuration to only swap and page to a single
> disk. This change does not change the behavior of my machine at all. Once
> it begins using the paging area on the disk, it won't be long until a hang
> occurs.
Hum, I have some systems which are paging a lot, and don't have any problems
with them. I've got one crashing during a crash test, but it has
ran for 12 hours at a load average of about 300, with ram+swap full at 99.99%
(new process creation was balanced with the system killing processes because
it was out of swap). I don't remember the details but the crash was probably
related to one of the problems left with ram resources shortage (it was
a panic, not a uvm_fault). Not something to worry in normal use.
>
> I've captured several core files from this hanging process, and would
> be happy to provide details for anyone who might be able to help shed light
> on the problem.
>
> I've captured core files which demonstrate both problems, and am
> willing to try and troubleshoot this problem further if anyone can provide
> guidence. I have 1.6.1 sources as of April 4, 2003 and I have a full
> symbol table copy of the kernel ready to trace with gdb, ps, vmstat, or
> what ever.
>
> Right now, the machine will not stay up more than 24 hours, and,
> usually, it crashes due to one of the problems within twelve hours of a
> restart.
> I'm usually pretty good at tracking down problems, but this one seems
> pretty thorney, and, I confess, I'm getting pretty frustrated. Would
> someone be willing to help me troubleshoot this problem further? I'm happy
> to provide any details, images, moral support, free beer, what ever. I've
> been using NetBSD for 10 years, and, in fact, this machine is supposed to
> be replacing my ancient NetBSD 0.9A system, but so far, that server is
> still more reliable than this shiny new 1.6.1 system.
Can you give more details about the hardware, and also on what the machine is
doing ? Obviously it's not a but which is triggererd in common use.
Also, did you consider a hardware problem ? Do you have another machine
to test in the same workload ?
--
Manuel Bouyer <bouyer@antioche.eu.org>
NetBSD: 24 ans d'experience feront toujours la difference
--