[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
uvm & percpu
While reading the uvm page allocator code, I noticed it tries to allocate
from percpu storage before falling back to global storage. However, even
if allocation from local storage was possible, a global stats counter is
incremented (e.g. "uvmexp.cpuhit++"). In my measurements I've observed
this type of "cheap" statcounting has a huge impact on percpu algorithms,
as you still need to load&store a globally contended memory address.
Furthermore, uvmexp cache lines are probably more contended than the page
queue, so theoretically you get less than half of the possible benefit.
I don't expect anyone to remember what the benchmark used to justify
the original percpu commit was, but if someone is going to work on it
further, I'm curious as to how much gain the percpu allocator produced
and how much more it would squeeze out if the global counter was left out.
The above example of course applies more generally. When you're going
all out with the bag of tricks, "i++" can be very expensive ...
Main Index |
Thread Index |