Subject: Re: kern/18636: Multiple uvm_pagefaults
To: Manuel Bouyer <bouyer@antioche.eu.org>
From: Don Phillips <don@resun.com>
List: netbsd-bugs
Date: 10/13/2002 11:31:46
>>>>> "Manuel" == Manuel Bouyer <bouyer@antioche.eu.org> writes:

[...]

Manuel> Well, this really looks like a hardware problem. 

'Twas my initial thought, however:

Memtest86 was run in extended mode.

I spent a month tracking the problem down to a SW subsystem.  I
replaced all of the memory.  I replaced the MB and processor.  I
reproduced it in both environments.  MBs were from two different
manufacturers.

Yep.  I, too, thought it was HW.  'I are a SW engineer.'  I'd say
that unless we've managed to find a flaw in two different
motherboards, running two different processors (AMD K-6, Athelon
1.53GZ), utilizing new memory modules, with a new HD, it would seem
to pretty solidly point to something in SW, and since the kernel
crashes, I'd say that the kernel, at a minimum, owns a piece of the
problem.

The only HW in common between the two systems were network cards.
Yesterday, on 1.6, I made a kernel with all network cards and the
MII/PHYs disabled (but left the cards in place).

Manuel> I have various i386 systems running 1.6, all of them have
Manuel> been stable.

Wouldn't surprise me.  I sincerely believe that the release was well
tested before it's release.  :-)

Manuel> I'me even been pushing a system hard this week-end (to test
Manuel> a machine rebuild from various pieces of hardware) running
Manuel> make -j20 or make -j40 (depending on the amout of RAM)
Manuel> kernel builds, and a build.sh -j10.  I've tested various RAM
Manuel> config from 32M to 128M.

Ah!  Now, we've got a difference that may lead to something.  The
configs that are breaking are 384MB (128MB+256MB) and 512MB.  And
the 1.5.2 failures, I believe, are after I upgraded from 256M to
384M.

And it would explain why I'm seeing the problem, but nobody else.  I
have *lots* of memory.

So, maybe a boundary condition, somewhere in the uvm system, for
large memory systems?  It doesn't always break.  1.5.2 is stable,
unless I'm running the SW system that uses DBs with tables of
1.2GB.

I've also been in touch with Chuck Silvers.  I've put various crash
dumps (and associated gdb kernels) up at ftp.netbsd.org in his incoming
directory.

Regards,
-- 
  Don Phillips         don@resun.com
  Escondido, Calif.    My opinions are just that, and no more.