Subject: pmap layer inefficiencies, and possible solutions
To: None <tech-kern@netbsd.org>
From: Charles M. Hannum <root@ihack.net>
List: tech-kern
Date: 12/07/1998 09:33:41
One of the more `interesting' aspects of the Mach VM system, inherited
by UVM, is the way page inactivation and swapping are done.  To
review:

* Pages are initially put on an `active list'.

* When the number of pages on the `inactive list' (that is, pages
  eligible for paging out) gets too small, we walk the active list
  looking for pages that have not been used (referenced or modified)
  since the last walk.

* When the number of pages on the free list gets too small, we walk
  the inactive list looking for pages that still have not been used,
  and schedule them for asynchronous pageout.

The place where we get severe inefficiencies here is the process of
walking the active and inactive lists.  Due to the high-level
operations done on physical pages, the pmap module is forced to keep a
list of virtual mappings (the pv_list) for each physical page, and to
walk that list when these operations are done.  There are some
problems with this:

* Due to the way these mini lists are maintained, just accessing the
  list causes TLB thrashing.

* On processors that require emulation of referenced and modified
  bits, the clear_modified and clear_referenced operations cause
  additional TLB thrashing, and cause each process to take a fault the
  next time a page is touched.

  Some of these pmap modules have an optimization to stop invalidating
  PTEs as soon as one reference is found, but note that this has
  problems; the order of entries in the pv_list does not change over
  time, so if there are some long lived processes, we will always
  check their virtual mappings for uses first.  The optimization also
  doesn't buy you anything when the page is actually being paged out,
  because a full walk and invalidate is still need.

* On the i386 and ns32k, due to the method of accessing page tables,
  we do a TLB flush for each entry on the pv_list.  This is extremely
  inefficient.

* The pv_list maintenance will require interprocessor locks for SMP
  support.  This too is inefficient.

* To make matters worse, removing a mapping in a single pmap causes us
  to do a linear search of the pv_list to find the entry for that
  {pmap,va} pair.  This is poor for heavily shared pages (such as
  shared libraries), particularly because of issues with long lived
  processes/pmaps.

Unfortunately, some of these problems are difficult -- if not
impossible -- to solve.  While it would be possible to get rid of
pmap_is_referenced() by using different data structures, we must have
pmap_is_modified() in order to support operations such as msync(2).
This requires being able to find all the virtual mappings for a page
efficiently, and therefore requires something like the pv_lists.

Also unfortunately, if we modify the pv_list at every point where a
bmapping changes, one of pmap_remove() or pmap_{is,clear}_{mod,ref}*()
has to have lousy TLB usage.  (This is because one touches multiple
pv_entrys for the same pmap and different VAs, and the other touches
multiple pv_entrys for the same PA.  This is a classic row-vs-column
problem.)

But there are some potential improvements we can make:

* Use a move-toward-front algorithm to help minimize the distance we
  have to walk down a pv_list to find a referenced or modified entry,
  by bubbling idle entries toward the end and highly active entries
  toward the beginning.

* Make pmap_remove() only invalidate page table entries, but *not*
  modify the pv_list.  The actual garbage collection of pv_entrys can
  be done during the pv_list walks in pmap_{is,clear}_{mod,ref}*();
  since we have to touch the PTE anyway, we can check there to see
  whether it's still valid and if not GC that entry.

  A trick is necessary here in case the VA is reused before a pv_list
  walk is done.  To handle that case, we can leave the old PA in the
  PTE (but with page permissions revoked), and iff the new mapping has
  the same PA just not create a new pv_entry in pmap_enter().

  We also need some way of GCing stale pv_entrys periodically if it's
  not done by normal page scanning.  One option here is to keep a
  global count of stale entries and kick off some GC process if that
  gets too large (similar to the [currently disabled] code to compact
  pv_pages).

  This eliminates the need to walk the pv_list in pmap_remove(), and
  the corresponding TLB thrashing and usage of the global page locks.

* Make some attempt to keep pv_entrys for the same physical page close
  together.  I haven't thought about this much, but power-of-2
  allocators generally help with things like this.  This would reduce
  TLB thrashing during pv_list walks.

This still leaves the TLB inefficiencies in pmap_enter() and another
TLB inefficiency in pv_list walks (because they touch each PTE).
There probably isn't much to be done about these.

There is also the additional i386/ns32k issue.  A couple of partial
solutions come to mind:

* Keep all PTPs mapped in KVA space, and keep a shadow copy of the
  PDE, with VAs rather PAs.  This way the PTE for any pmap can be
  accessed without doing a TLB flush.

* Have pmap_map_ptes() only guarantee that one PTP is mapped; this way
  only one TLB entry has to be flushed.  (This wouldn't help on the
  original i386, not that that's a big consideration at this point.)

  There are several variations on this, from using a LRU of PTPs to
  keeping the `alternate page table' region as is but only using one
  page at a time from it.

Those are my thoughts for today.  Shar and enjoy.