netbsd-users: Re: kernel panics ...

Subject: Re: kernel panics ...
To: Steven M. Bellovin <smb@cs.columbia.edu>
From: Thor Lancelot Simon <tls@rek.tjls.com>
List: netbsd-users
Date: 01/19/2006 22:54:16

On Thu, Jan 19, 2006 at 10:39:15PM -0500, Steven M. Bellovin wrote:
> In message <20060120033019.GE828@mjch.net>, Malcolm Herbert writes:
>
> >|uvm_fault(0xcc191b6c, 0, 0, 1) -> 0xe
> >|kernel: page fault trap, code = 0
> >|Sopped in pid 6374, 1 (make) at netbsd:getcwd_common+0x5d:  mov1  0(%edx),  %
[...]
> >Looking at that fragment, should I suspect memory or hard-drive? 
> 
> My money would be on memory, and it's easy to test -- get a copy of 
> memtest86+ from http://www.memtest.org/ and boot it.  Note: memtest86+ 
> appears to find more problems than pkgsrc/sysutils/memtest.

I would have bet the same way a few months ago.  But since then, I've
seen a number of brand-new machines, with high-quality name brand
memory, do this -- with ECC and scrubbing, including cache scrubbing
and background scrubbing, turned on.  In my case, I have identical
systems with single-core and dual-core processors, and see it only on
the dual-cores (4-way multiprocessors), which is a bit unnerving.  On
the one hand, four Opteron cores clearly can put more demand on the memory
subsystem than two.  On the other hand, it also seems like with more cores,
there's considerably more room for some kind of synchronization bug to rear
its head.

Thor