Re: performance issues during build.sh -j 40 kernel

On Sat, Sep 09, 2017 at 08:48:19PM +0200, Mateusz Guzik wrote:
>
> Here is a bunch of "./build.sh -j 40 kernel=MYCONF > /dev/null" on stock
> kernel:
>   618.65s user 1097.80s system 2502% cpu 1:08.60 total
[..]
>
> And on kernel with total hacks:
>   594.08s user 693.11s system 2459% cpu 52.331 total
[..]
>
> ======================================
>
> Here is a flamegraph from a fully patched kernel:
> https://people.freebsd.org/~mjg/netbsd/build-kernel-j40.svg
>
> And here are top mutex spinners:
> 59.42 1560022 184255.00 ffffe40138351180       <all>
> 57.52 1538978 178356.84 ffffe40138351180       uvm_fault_internal+7e0
>   1.23    8884   3819.43 ffffe40138351180       uvm_unmap_remove+101
>   0.67   12159   2078.61 ffffe40138351180       cache_lookup+97
>
> (see https://people.freebsd.org/~mjg/netbsd/build-kernel-j40-lockstat.txt )
>

So I added PoC batching to uvm_fault_lower_lookup and uvm_anon_dispose.

While real time barely moved and %sys is somewhat floating around 630,
I'm happy to inform that wait time on global locks locks dropped
significantly:

46.03 1162651 85410.88 ffffe40127167040       <all>
43.80 1146153 81273.38 ffffe40127167040       uvm_fault_internal+7c0
1.52    7112   2827.06 ffffe40127167040       uvm_unmap_remove+101
0.71    9385   1310.42 ffffe40127167040       cache_lookup+a5
0.00       1      0.01 ffffe40127167040       vfs_vnode_iterator_next1+87

https://people.freebsd.org/~mjg/netbsd/build-kernel-j40-hacks2.svg

https://people.freebsd.org/~mjg/netbsd/build-kernel-j40-hacks2-lockstat.txt

You can see on the flamegraph that the entire time spent in the page
fault handler dropped and the non-user time shifted to syscall handling.

Specifically, now genfs_lock is a more significant player accounting for
about 8.7% total time (6.6% previously).

Batching can be enabled with:
sysctl -w use_anon_dispose_pagelocked=1
sysctl -w uvm_fault_batch_requeue=1

Mix of total hackery is here:
https://people.freebsd.org/~mjg/netbsd/hacks2.diff

I'm quite certain there are other trivial wins in the handler.

I also noted that mutex_spin_retry routine never lowered spl. I added
total crap support to changing it, but did not measure any difference.

There is also a currently wrong hack for the namecache: instead of
taking the interlock, first check if usecount if 0 and if not, try to
bump it by 1. This races with possible transitions to VS_BLLOCKED.

I think the general idea will work fine if the prohibited state will get
embedded into top bits of the v_usecount. Regular bumps will be
unaffected, while cmpxchg like here will automagically fail. There is
only a problem of code reading the count "by hand" which would have to
be updated to mask the bit.

The reason for the hack is that the interlock is in fact the vm obj
lock and it adds tad bit of contention.

Probably with few more fixes of the sort cranking up backoff will be

beneficial.

--
Mateusz Guzik
Swearing Maintenance Engineer