Subject: Re: Bootable on Tyan Thunder, but crashes during a kernel compile
To: Christos Zoulas <christos@zoulas.com>
From: Frank van der Linden <fvdl@vaasje.org>
List: tech-smp
Date: 11/09/2001 16:28:48
On Fri, Nov 09, 2001 at 10:03:04AM -0500, Christos Zoulas wrote:
> uvm_fault(0xc05dfee0, 0xc1976000, 0, 3) = e
> pid 395 ls in
> pmap_alloc_pvpage+0x23
> pmap_enter+0x59d
> vvm_fault+0x7e3
> trap+0x62e
> trap 6
> 
> It seems that the pmap gets corrupted under heavy activity.

Ok, that's exactly what I am seeing. At least it's not a problem
specific to my board.. although we're talking about the same
type of board here.

But, I'm not seeing ghosts, so that's always good to know :)

This is what I've found out so far about this problem:

* Under heavy pagein activity, some data structures are overwritten
  with garbage/data meant for something else
* The most likely candidates to be thrashed are 1) the pv pages,
  2) amap structures. I've not seen it hit anything else.
* This leads to a panic usually through uvm_fault as you are seeing,
  sometimes it is postponed until the amaps are wiped out.
* I see it when starting X only (which obviously leads to a burst
  of pagein activity).
* It does not seem to be a TLB consistency problem as I thought at
  one point. There is no missed IPI or anything, the TLB shootdown
  queues are empty when the problem occurs.
* Using debug malloc to see if anything is writing beyond the end of
  the data structures has not given me any results so far (the panic
  still happened, but wasn't caught by the malloc debugging code).
  I've tried it on all relevant types of memory.
* There doesn't seem to be a race condition for the pvlist structures;
  pvalloc_lock protects these, and I added a bunch of KASSERT()s to see
  if the lock wasn't taken anywhere, but none of them fired.
* I tried allocating an empty page before or after each pvpage to catch
  anything writing into them, but this didn't catch it either (which
  is strange).

- Frank