tech-kern: misc. MMU: NUMA, big pages, idle zero, ring buffers, PAE, ...

Subject: misc. MMU: NUMA, big pages, idle zero, ring buffers, PAE, ...
To: None <tech-kern@netbsd.org>
From: Edward B. DREGER <eddy+public+spam@noc.everquick.net>
List: tech-kern
Date: 04/29/2006 05:23:37
Greetings all,


[ Apologies for earlier cross-posting.  No idea why I sent to -perform
instead of -kern. ]

Note:  Some of these ramblings are ia32/aa64-focused, but the principles
are general.

While exploring PAE last November, I wound up browsing through uvm/pmap
code.  I've had a few additional ideas, and would like some [more]
feedback.


/* Big Pages */

Begin by allocating memory stride 2M/4M (former iff PAE, latter iff
!PAE).  Track wasted 4K [sub]pages.  Split big pages into smaller ones
when needed, but avoid using page tables until then.  Coalesce smaller
pages into bigger ones when free RAM permits.

Rationale:  Hopefully less MMU management overhead and fewer TLB misses
while memory is plentiful.  Fall back to standard behavior when needed.


/* Fractional/Checkpointed Zeroing of Big Pages */

I whipped up a crude program that performed 1000 bzero(3) iterations on
a 2M chunk.  Each iteration took about 9 ms on a PIII/500 notebook.
Should the idle-zero loop zero a fraction of a big page?  What about
dedicating a PDE slot (Intel terminology) to the zero code?

Rationale:  Several milliseconds -- although certainly less than 9 ms
when on faster CPU and with optimized zeroing code -- is an eternity.


/* Per-CPU Management */

Both of the above, as well as free page lists, should be per-CPU.  Can a
CPU be forced to work with the memory closest to it?  (Consider NUMA
performance, such as multiprocessor Opteron systems.)

Rationale:  Reduced inter-CPU contention.  Assuming processes have
significant CPU affininty, using "nearby" memory would reduce reduce
both interconnect bandwidth use and memory access time.


/* Ring Buffers */

A native mapping for ring buffers would be nice:

  	u_char *ringbuf = mmapringbuf(..., MAP_RINGBUF, ...) ;

would allocate a memory region from <base> to <base + 2 * size>.  i.e.,

  	base
  	base + size

would both be aliased to the same physical pages.  Voila!  Simple,
linear ringbuf where the MMU handles wraparound at the region's end.

Rationale:  It's just so much easier this way. :-)


/* mremap() */

Zero-copy allocation-size changes are convenient.

Rationale:  Obvious.


Eddy
--
Everquick Internet - http://www.everquick.net/
A division of Brotsman & Dreger, Inc. - http://www.brotsman.com/
Bandwidth, consulting, e-commerce, hosting, and network building
Phone: +1 785 865 5885 Lawrence and [inter]national
Phone: +1 316 794 8922 Wichita
________________________________________________________________________
DO NOT send mail to the following addresses:
davidc@brics.com -*- jfconmaapaq@intc.net -*- sam@everquick.net
Sending mail to spambait addresses is a great way to get blocked.
Ditto for broken OOO autoresponders and foolish AV software backscatter.