tech-kern archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
A new page allocator for UVM
Hi,
Anyone interested in taking a look? This solves the problem we have with
uvm_fpageqlock. Here's the code, and the blurb is below:
http://www.netbsd.org/~ad/2019/allocator.diff
Cheers,
Andrew
General overview
The page allocator currently works in 4 dimensions, with freelist
being the most important consideration, page colour the next, and so
on. It has a global counter tracking the number of free pages.
uvm -> freelist[f] -> colour[c] -> queue[q] +-> global list
\-> local CPU list
uvm -> uvmexp.free
It turns out that this allocator has a "lock contention problem",
which is actually more of a false sharing problem, because
allocating a page involves touching and writing back to many cache
lines.
The new allocator is reduced to working in 3 dimensions. The
separate lists for zeroed/unknown pages have been deleted (but the
ability to track whether an individual page is zeroed is retained),
the CPU free lists have been deleted, and a new concept of "bucket"
introduced. Again freelist is the most important consideration,
bucket is the next, and colour the last (* unless the experimental
NUMA stuff is enabled, but that's disabled for now).
uvm -> freelist[f] -> bucket[b] -> colour[c]
\
-> pgb_nfree
The data structures to manage this are carefully arranged so that
there's minimal false sharing, fewer write backs need to occur, and
empty freelists (of which there are often many) can be skipped
quickly without having to scan everything nor take locks.
Buckets
In both old & new allocators we have the ability to dynamically
rearrange the free lists to support an arbitrary number of page
colours. Buckets work this way too but there is a maximum number
(PGFL_MAX_BUCKETS, currently 8).
The number of buckets is sized at runtime according to the number of
physical CPU packages in the system (ci->ci_package_id), and each
logical CPU is assigned a preferred bucket to allocate from
(ucpu->pgflbucket).
This does a few things for us (when the system is not very low on
memory):
- It exponentially reduces lock contention by reducing the number of
CPUs competing for access to each bucket.
- The lock and data structures for each bucket typically remain in
cache somewhere on the chip, and thereby don't need to go over the
bus, further reducing contention.
- It has a positive effect on L2/L3 cache locality, because if we
free pages back to a bucket corresponding to the CPU where they
were last used, the pages held in free list are very likely to be
in cache on the chip they will next be allocated from. The
current allocator has a problem where logical CPUs "hoard" L2/L3
cache by having their own free lists.
Experimental rudimentary NUMA support
I thought it would be interesting to try this, just for fun.
This changes the configuration of the buckets to correspond to NUMA
node, assigns pages to buckets according to which node the hardware
tells us they live on, changes the strategy for freeing pages so
that pages ALWAYS go back to their assigned bucket, instead of
moving about as they do with the L2/L3 strategy, and changes the
strategy for allocating so that we consider allocating from the
correct bucket to be more important than allocating from the correct
free list.
I have disabled this for now even if NUMA is enabled on the system
because it's not quite there yet and more testing needs doing. It
does seem to work though (for a compile job).
Per-cpu caching
There is a tiny per-CPU caching layer that can be paused and resumed
at runtime, and it's only enabled when CPUs are sharing buckets.
This functions as a buffer, where pages are shuffled to and from
buckets in batch. On a typical Intel system, about 224kB worth of
pages will be cached. That's not much, but is enough to further
reduce lock contention by an order of magnitude without exerting too
much pressure on the L2/L3 cache by hoarding.
Idlezero
I have removed the guts of this, but the hooks to support it are
still in place. I think it may be useful to experiment with zeroing
the per-CPU cached pages, or try some other method, but what we have
now doesn't work well on modern systems.
How this looks in practice
Here is what the freelists look like on my system, and the dmesg
from said system.
http://www.netbsd.org/~ad/2019/freelist.txt
http://www.netbsd.org/~ad/2019/dmesg.boot
Results:
This from a "make -j96" kernel build, with a !DIAGNOSTIC, GENERIC
kernel on the system mentioned above. System time is the most
interesting here. With NUMA disabled in the BIOS:
74.55 real 1635.13 user 725.04 sys before
72.66 real 1653.86 user 593.19 sys after
With NUMA enabled in the BIOS & the allocator:
76.81 real 1690.27 user 797.56 sys before
71.10 real 1632.42 user 603.41 sys after
Lock contention before (no NUMA):
Total% Count Time/ms Lock
------ ------- --------- ----------------------
99.80 36756212 182656.88 uvm_fpageqlock
Lock contention after (no NUMA):
Total% Count Time/ms Lock
------ ------- --------- ----------------------
20.21 196928 132.50 uvm_freelist_locks+40 <all>
18.72 180522 122.74 uvm_freelist_locks <all>
Home |
Main Index |
Thread Index |
Old Index