A new page allocator for UVM

To: tech-kern%netbsd.org@localhost
Subject: A new page allocator for UVM
From: Andrew Doran <ad%netbsd.org@localhost>
Date: Sun, 22 Dec 2019 21:14:44 +0000

Hi,

Anyone interested in taking a look?  This solves the problem we have with
uvm_fpageqlock.  Here's the code, and the blurb is below:

	http://www.netbsd.org/~ad/2019/allocator.diff

Cheers,
Andrew


General overview

	The page allocator currently works in 4 dimensions, with freelist
	being the most important consideration, page colour the next, and so
	on.  It has a global counter tracking the number of free pages.

	uvm -> freelist[f] -> colour[c] -> queue[q] +-> global list
	                                            \-> local CPU list

	uvm -> uvmexp.free

	It turns out that this allocator has a "lock contention problem",
	which is actually more of a false sharing problem, because
	allocating a page involves touching and writing back to many cache
	lines.

	The new allocator is reduced to working in 3 dimensions.  The
	separate lists for zeroed/unknown pages have been deleted (but the
	ability to track whether an individual page is zeroed is retained),
	the CPU free lists have been deleted, and a new concept of "bucket"
	introduced.  Again freelist is the most important consideration,
	bucket is the next, and colour the last (* unless the experimental
	NUMA stuff is enabled, but that's disabled for now).

	uvm -> freelist[f] -> bucket[b] -> colour[c]
	                         \
	                          -> pgb_nfree

	The data structures to manage this are carefully arranged so that
	there's minimal false sharing, fewer write backs need to occur, and
	empty freelists (of which there are often many) can be skipped
	quickly without having to scan everything nor take locks.

Buckets

	In both old & new allocators we have the ability to dynamically
	rearrange the free lists to support an arbitrary number of page
	colours.  Buckets work this way too but there is a maximum number
	(PGFL_MAX_BUCKETS, currently 8).

	The number of buckets is sized at runtime according to the number of
	physical CPU packages in the system (ci->ci_package_id), and each
	logical CPU is assigned a preferred bucket to allocate from
	(ucpu->pgflbucket).

	This does a few things for us (when the system is not very low on
	memory):

	- It exponentially reduces lock contention by reducing the number of
	  CPUs competing for access to each bucket.

	- The lock and data structures for each bucket typically remain in
	  cache somewhere on the chip, and thereby don't need to go over the
	  bus, further reducing contention.

	- It has a positive effect on L2/L3 cache locality, because if we
	  free pages back to a bucket corresponding to the CPU where they
	  were last used, the pages held in free list are very likely to be
	  in cache on the chip they will next be allocated from.  The
	  current allocator has a problem where logical CPUs "hoard" L2/L3
	  cache by having their own free lists.

Experimental rudimentary NUMA support

	I thought it would be interesting to try this, just for fun.

	This changes the configuration of the buckets to correspond to NUMA
	node, assigns pages to buckets according to which node the hardware
	tells us they live on, changes the strategy for freeing pages so
	that pages ALWAYS go back to their assigned bucket, instead of
	moving about as they do with the L2/L3 strategy, and changes the
	strategy for allocating so that we consider allocating from the
	correct bucket to be more important than allocating from the correct
	free list.

	I have disabled this for now even if NUMA is enabled on the system
	because it's not quite there yet and more testing needs doing.  It
	does seem to work though (for a compile job).

Per-cpu caching

	There is a tiny per-CPU caching layer that can be paused and resumed
	at runtime, and it's only enabled when CPUs are sharing buckets. 
	This functions as a buffer, where pages are shuffled to and from
	buckets in batch.  On a typical Intel system, about 224kB worth of
	pages will be cached.  That's not much, but is enough to further
	reduce lock contention by an order of magnitude without exerting too
	much pressure on the L2/L3 cache by hoarding.

Idlezero

	I have removed the guts of this, but the hooks to support it are
	still in place.  I think it may be useful to experiment with zeroing
	the per-CPU cached pages, or try some other method, but what we have
	now doesn't work well on modern systems.

How this looks in practice

	Here is what the freelists look like on my system, and the dmesg
	from said system.

	http://www.netbsd.org/~ad/2019/freelist.txt
	http://www.netbsd.org/~ad/2019/dmesg.boot

Results:

	This from a "make -j96" kernel build, with a !DIAGNOSTIC, GENERIC
	kernel on the system mentioned above.  System time is the most
	interesting here.  With NUMA disabled in the BIOS:

	       74.55 real      1635.13 user       725.04 sys	before
	       72.66 real      1653.86 user       593.19 sys	after

	With NUMA enabled in the BIOS & the allocator:

	       76.81 real      1690.27 user       797.56 sys	before
	       71.10 real      1632.42 user       603.41 sys	after

	Lock contention before (no NUMA):

	Total%  Count   Time/ms          Lock          
	------ ------- --------- ----------------------
	 99.80 36756212 182656.88 uvm_fpageqlock       

	Lock contention after (no NUMA):

	Total%  Count   Time/ms          Lock
	------ ------- --------- ----------------------
	 20.21  196928    132.50 uvm_freelist_locks+40  <all>
	 18.72  180522    122.74 uvm_freelist_locks     <all>

Follow-Ups:
- Re: A new page allocator for UVM
  - From: lars Reichardt

Prev by Date: [PATCH v4 3/4] Combine x86 register tests into unified test function
Next by Date: Re: [PATCH] i2c bus acquire / release cleanup, forbid i2c access in hard interrupt context
Previous by Thread: [PATCH v4 1/4] Include XSTATE note in x86 core dumps
Next by Thread: Re: A new page allocator for UVM
Indexes:

Home | Main Index | Thread Index | Old Index