netbsd-bugs: port-i386/33488: kernel fault in pmap

Subject: port-i386/33488: kernel fault in pmap_enter
To: None <port-i386-maintainer@netbsd.org, gnats-admin@netbsd.org,>
From: None <jld@panix.com>
List: netbsd-bugs
Date: 05/16/2006 02:40:00
>Number:         33488
>Category:       port-i386
>Synopsis:       kernel fault in pmap_enter (on xchg insn)
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    port-i386-maintainer
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue May 16 02:40:00 +0000 2006
>Originator:     Jed Davis
>Release:        NetBSD 3.0
>Organization:
PANIX Public Access Internet and UNIX, NYC
>Environment:
System: NetBSD www2.panix.com 3.0 NetBSD 3.0 (PANIX-WEB) #6: Thu May  4 17:37:30 EDT 2006  root@juggler.panix.com:/devel/netbsd/3.0/src/sys/arch/i386/compile/PANIX-WEB i386
Architecture: i386
Machine: i386
>Description:

This panic occurs every so often on heavily used webservers and at least
one mail server, thusly:

  uvm_fault(0xd61961c4, 0xbfc00000, 0, 2) -> 0xe
  fatal page fault in supervisor mode
  trap type 6 code 2 eip c028f00d cs 8 eflags 10282 cr2 bfc00100 ilevel 0
  panic: trap
  Begin traceback...
  trap() at netbsd:trap+0x137
  --- trap (number 6) ---
  pmap_enter(d542473c,40000,29d98000,5,20) at netbsd:pmap_enter+0x1d9
  uvm_fault(d61961c4,41000,0,1,d6b93f5c) at netbsd:uvm_fault+0xbf9
  trap() at netbsd:trap+0x2ff
  --- trap (number 6) ---
  0x41e08:
  End traceback...

  uvm_fault(0xd6db8e08, 0xbfc28000, 0, 2) -> 0xe
  fatal page fault in supervisor mode
  trap type 6 code 2 eip c028f035 cs 8 eflags 10282 cr2 bfc28a90 ilevel 0
  panic: trap
  Begin traceback...
  trap() at netbsd:trap+0x137
  --- trap (number 6) ---
  pmap_enter(d0fbfbb0,a2a4000,161b9000,1,20) at netbsd:pmap_enter+0x1d9
  uvm_fault(d6db8e08,a2a1000,0,1,0) at netbsd:uvm_fault+0xd31
  trap() at netbsd:trap+0x2ff
  --- trap (number 6) ---
  0x80cf5f0:
  End traceback...

  uvm_fault(0xceaa8a80, 0xbfed6000, 0, 2) -> 0xe
  fatal page fault in supervisor mode
  trap type 6 code 2 eip c028f035 cs 8 eflags 10282 cr2 bfed6294 ilevel 0
  panic: trap
  Begin traceback...
  trap() at netbsd:trap+0x137
  --- trap (number 6) ---
  pmap_enter(ceaa9390,b58a5000,207c8000,7,20) at netbsd:pmap_enter+0x1d9
  uvm_fault(ceaa8a80,b58a2000,0,1,d0615fa8) at netbsd:uvm_fault+0xd31
  trap() at netbsd:trap+0x2ff
  --- trap (number 6) ---
  0xb956834e:
  End traceback...

  uvm_fault(0xd72aca90, 0xbfc26000, 0, 2) -> 0xe
  db> bt
  pmap_enter(d5368480,9ba7000,1b2e1000,1,21) at netbsd:pmap_enter+0x1d9
  uvm_fault(d72aca90,9ba7000,0,1,c0114d77) at netbsd:uvm_fault+0x458
  trap() at netbsd:trap+0x2ff
  --- trap (number 6) ---
  0x80cf5f0:
  db>

The kernel has no debug info, but the faulting instruction (assuming the
trace is reporting the correct address) in the assembly is, in context:

  0xc028f02a <pmap_enter+462>:    call   0xc028cf1c <pvtree_SPLAY_INSERT>
  0xc028f02f <pmap_enter+467>:    add    $0x10,%esp
  0xc028f032 <pmap_enter+470>:    mov    0xffffffb4(%ebp),%ecx
**0xc028f035 <pmap_enter+473>:    xchg   %edi,(%ecx)
  0xc028f037 <pmap_enter+475>:    mov    %edi,0xffffffb8(%ebp)
  0xc028f03a <pmap_enter+478>:    andl   $0x21,0xffffffb8(%ebp)
  0xc028f03e <pmap_enter+482>:    cmpl   $0x21,0xffffffb8(%ebp)
  0xc028f042 <pmap_enter+486>:    je     0xc028f04e <pmap_enter+498>

Which corresponds to this line of pmap.c, r1.XXX:

   3467         opte = x86_atomic_testset_ul(ptep, npte);   /* zap! */

>How-To-Repeat:

If we wait long enough, it'll show up.  It's been found on at least
three different hosts, so it almost certainly can't be the hardware.

All affected hosts are acting as NFS clients.

We have core files.

>Fix:

I have none, but I've investigated a bit:

The fault addresses all check out -- the second fault's address is
within the PTP in the main recursive page table mapping corresponding
to the first fault's address.  Since it was the main mapping and the
pmap was not the kernel's, we know (from inspection of pmap_map_ptes)
that that pmap was current at the time of the fault; and yet, in the
core file, the entire user side of the current PDP is zeroed.  Also, the
address in "ptep" has to be valid when that variable is assigned to, as
it's dereferenced immediately after to get the old PTE; yet somehow the
mapping gets yanked out from under it by the time of the xchg?

And that's as far as I've gotten.