Re: uvm & percpu

To: Antti Kantee <pooka%cs.hut.fi@localhost>, tech-kern%netbsd.org@localhost
Subject: Re: uvm & percpu
From: Andrew Doran <ad%NetBSD.org@localhost>
Date: Tue, 1 Jun 2010 15:16:30 +0000

On Tue, Jun 01, 2010 at 04:03:19PM +0300, Antti Kantee wrote:
> While reading the uvm page allocator code, I noticed it tries to allocate
> from percpu storage before falling back to global storage.  However, even
> if allocation from local storage was possible, a global stats counter is
> incremented (e.g. "uvmexp.cpuhit++").  In my measurements I've observed
> this type of "cheap" statcounting has a huge impact on percpu algorithms,
> as you still need to load&store a globally contended memory address.
> Furthermore, uvmexp cache lines are probably more contended than the page
> queue, so theoretically you get less than half of the possible benefit.
> 
> I don't expect anyone to remember what the benchmark used to justify
> the original percpu commit was, but if someone is going to work on it
> further, I'm curious as to how much gain the percpu allocator produced
> and how much more it would squeeze out if the global counter was left out.
> 
> The above example of course applies more generally.  When you're going
> all out with the bag of tricks, "i++" can be very expensive ...

I ran into the same issue with timecounters and there it was a huge
overhead - disabling the counter showed great performance improvements
where the timecounter hardware was largely parallel and light on
memory access (CPU timestamp counter).

With the UVM allocator its less of an overhead since the allocator is
shielded by uvm_fpageqlock and often uvm_pageqlock (both globals),
and the data structures are not intentionally organised/optimized 
with MP cache behaviour in mind.  This is an area that definitely needs
improvement, and there's good potential there.  Consider NUMA,
or as a middle ground multiple sets of pagedaemon/allocator state that
correspond to cache units (cores, chips, whatever) or execution units
(threads, cores, whatever).

The per-CPU bit worked out to be a win on the build.sh benchmark.
I implemented it to try and avoid cache writebacks to main memory
for short lived processes, due to activity within anonymous and COWed
pages.  It could be a win for some other mad reason but I assume that
the witnessed speed-up is for the reason outlined.

References:
- uvm & percpu
  - From: Antti Kantee

Prev by Date: Re: uvm & percpu
Next by Date: Re: uvm & percpu
Previous by Thread: Re: uvm & percpu
Next by Thread: [Patch] Bootinfo root device (x86)
Indexes:

Home | Main Index | Thread Index | Old Index