On Sat, Sep 09, 2017 at 08:48:19PM +0200, Mateusz Guzik wrote:
>
> Here is a bunch of "./build.sh -j 40 kernel=MYCONF > /dev/null" on stock
> kernel:
> 618.65s user 1097.80s system 2502% cpu 1:08.60 total
[..]
>
> And on kernel with total hacks:
> 594.08s user 693.11s system 2459% cpu 52.331 total
[..]
>
> ======================================
>
> Here is a flamegraph from a fully patched kernel:
>
https://people.freebsd.org/~mjg/netbsd/build-kernel-j40.svg>
> And here are top mutex spinners:
> 59.42 1560022 184255.00 ffffe40138351180 <all>
> 57.52 1538978 178356.84 ffffe40138351180 uvm_fault_internal+7e0
> 1.23 8884 3819.43 ffffe40138351180 uvm_unmap_remove+101
> 0.67 12159 2078.61 ffffe40138351180 cache_lookup+97
>
> (see
https://people.freebsd.org/~mjg/netbsd/build-kernel-j40-lockstat.txt )
>
So I added PoC batching to uvm_fault_lower_lookup and uvm_anon_dispose.
While real time barely moved and %sys is somewhat floating around 630,
I'm happy to inform that wait time on global locks locks dropped
significantly:
46.03 1162651 85410.88 ffffe40127167040 <all>
43.80 1146153 81273.38 ffffe40127167040 uvm_fault_internal+7c0
1.52 7112 2827.06 ffffe40127167040 uvm_unmap_remove+101
0.71 9385 1310.42 ffffe40127167040 cache_lookup+a5
0.00 1 0.01 ffffe40127167040 vfs_vnode_iterator_next1+87
https://people.freebsd.org/~mjg/netbsd/build-kernel-j40-hacks2.svg https://people.freebsd.org/~mjg/netbsd/build-kernel-j40-hacks2-lockstat.txt You can see on the flamegraph that the entire time spent in the page
fault handler dropped and the non-user time shifted to syscall handling.
Specifically, now genfs_lock is a more significant player accounting for
about 8.7% total time (6.6% previously).
Batching can be enabled with:
sysctl -w use_anon_dispose_pagelocked=1
sysctl -w uvm_fault_batch_requeue=1
Mix of total hackery is here:
https://people.freebsd.org/~mjg/netbsd/hacks2.diff I'm quite certain there are other trivial wins in the handler.
I also noted that mutex_spin_retry routine never lowered spl. I added
total crap support to changing it, but did not measure any difference.
There is also a currently wrong hack for the namecache: instead of
taking the interlock, first check if usecount if 0 and if not, try to
bump it by 1. This races with possible transitions to VS_BLLOCKED.
I think the general idea will work fine if the prohibited state will get
embedded into top bits of the v_usecount. Regular bumps will be
unaffected, while cmpxchg like here will automagically fail. There is
only a problem of code reading the count "by hand" which would have to
be updated to mask the bit.
The reason for the hack is that the interlock is in fact the vm obj
lock and it adds tad bit of contention.