Subject: Re: vm_fault in pmap_changebit()
To: Dave Huang <khym@bga.com>
From: Chuck Cranor <chuck@maria.wustl.edu>
List: port-i386
Date: 11/11/1997 12:03:08
>Well, my machine crashed again, and I got the following message:
>Bad PV entry discovered: pmap=0xf0676b80, va=0xefbfd000
>  NO page-table!
>vm_fault(0xf0694a00, 0, 1, 0) -> 5
>fatal page fault in supervisor mode
>trap type 6 code efbf0000 eip f01a8bc4 cs f01a0008 eflags 10286 cr2 0 cpl e00044c2
>panic: trap
>And as usual, the fault occurs at the "*pte = (*pte & maskbits) | setbits"
>line... pte is a null pointer.

>I've tried to follow the stack frames by hand... except for the
>pmap_changebit, these are the return addresses on the stack, so
>they're pointing to the line after the actual call. Hopefully, I did
>this correctly :)

>0xf01a8bc4 is in pmap_changebit (../../../../arch/i386/i386/pmap.c:1658).
>0xf019dad9 is in vm_page_deactivate (machine/pmap.h:177). [ actually the inline function pmap_clear_reference ]
>0xf019e0bc is in vm_pageout_scan (../../../../vm/vm_pageout.c:241).
>0xf019e3d9 is in vm_pageout (../../../../vm/vm_pageout.c:570).
>0xf01086a1 is in start_pagedaemon (../../../../kern/init_main.c:570).
>I lose the stack frames here... It looks like the next one is proc_trampoline, in locore.s

>So, does this mean anything to anyone? :) Kernel was compiled from
>November 9 source... core dump and kernel with debug symbols available
>if anyone wants to see.

basically, what is happening is that the pagedaemon is running because
there is a memory shortage.    the pagedaemon has already scanned the
inactive queue (to free pages) and is now in the process of moving
pages from the active queue to the inactive queue ("vm_page_deactivate").
as part of that process it wants to clear the "reference" bit.

to do this, the pmap must find every valid mapping of the page and
clear the reference bit in each PTE that is mapping the page.  the
pmap module keeps a list of <PMAP,VA> pairs [one for each active mapping
of the page] per each vm_page.

so, the pmap_changebit() is going down this list of <PMAP,VA> pairs
clearing the reference bit when it comes to the entry for
<PMAP=0xf0676b80, VA=0xefbfd000>.    VA 0xefbfd000 is the top of the
stack.    since that <PMAP,VA> pair is on the list, the pmap expects
the vm_page to be mapped at 0xefbfd000 in that pmap.   however, when
it looks at the page directory for pmap 0xf0676b80 it discovers that
not only is the vm_page not mapped at VA 0xefbfd000, but also the
pmap doesn't even have a page table page mapping the 4MB block of VM
that 0xefbfd000 lives in!


so, the problem is that either the page table page that was mapping
that address went away when it should not have, or the <PMAP,VA>
entry on the pv list is a stale one that shouldn't have been there
in the first place.   note that the page table pages are currently
part of the user's address space (from VM_MAXUSER_ADDRESS to
VM_MAX_ADDRESS) and there is an interesting hack to "pre-fault"
them in in trap.c.   this could be kind of painful to debug.



the somewhat good news is that i've re-written the i386 pmap from
scratch based on the current pmap, with various changes from myself,
mach, and freebsd thrown in.   i've been using this pmap for a couple
of months without any problems.  it is possible that it might solve your
problem.  the somewhat bad news is that my pmap re-write is designed
to fit in my VM system rather than the Mach based VM system currently in
the tree.

chuck