performance issues during build.sh -j 40 kernel

Hello,

I have been playing a little bit with a NetBSD vm running on Centos7 + kvm.
I ran into severe performance issues which I partially investigated.
A bunch of total hacks was written to confirm few problems, but there is
nothing committable without doing actual work and major problems remain.

I think the kernel is in dire need to have someone sit on issues reported

below and see them through. I'm happy to test patches, although I wont

necessarily have access to the same hardware used for current tests.

Hardware specs:
Intel(R) Xeon(R) Gold 5115 CPU @ 2.40GHz
2 sockets * 10 cores * 2 hardware threads
32GB of ram

I assigned all 40 threads to the vm + gave it 16GB of ram.

The host is otherwise idle.

I installed the 7.1 release, downloaded recent git snapshot and built the
trunk kernel while using config stolen from the release (had to edit out
something about 3g modems to make it compile). I presume this is enough
to not have debug of any sort enabled.

The filesystem is just ufs mounted with noatime.

Attempts to use virtio for storage resulted in extremely abysmall
performance which I did not investigate. Using SATA gave read errors
and the vm failed to boot multiuser. I settled for IDE which works
reasonbly fine, but inherently makes the test worse.

All tests were performed with the trunk kernel booted.

Here is a bunch of "./build.sh -j 40 kernel=MYCONF > /dev/null" on stock
kernel:
618.65s user 1097.80s system 2502% cpu 1:08.60 total
628.73s user 1128.71s system 2540% cpu 1:09.18 total
629.05s user 1082.58s system 2517% cpu 1:07.99 total
641.11s user 1081.05s system 2545% cpu 1:07.65 total
641.18s user 1079.89s system 2522% cpu 1:08.24 total

And on kernel with total hacks:
594.08s user 693.11s system 2459% cpu 52.331 total
594.81s user 711.90s system 2498% cpu 52.292 total
600.34s user 676.39s system 2486% cpu 51.336 total
597.33s user 725.78s system 2536% cpu 52.157 total
597.13s user 708.79s system 2510% cpu 52.011 total

i.e. it's still pretty bad, with system time being above user. However,
real time dropped from ~68 to ~52 and %sys from ~1100 to ~700.

Hacks can be seen here (wear gloves and something to protect eyes):
https://people.freebsd.org/~mjg/netbsd/hacks.diff

1) #define UBC_NWINS 1024

The parameter was set in 2001 and is used on amd64 to this very day.

lockstat says:
51.63 585505 321201.06 ffffe4011d8304c0       <all>
40.39 291550 251302.17 ffffe4011d8304c0       ubc_alloc+69
9.13 255967 56776.26 ffffe4011d8304c0       ubc_release+a5
1.72   35632 10680.06 ffffe4011d8304c0       uvm_fault_internal+532
[snip]

The contention is on the global ubc vmobj lock just prior to hash lookup.
I recompiled the kernel with randomly slapped value of 65536 and the the
problem cleared itself with ubc_alloc going way down.

I made no attempts to check what value makes sense or how to autoscale it.

This change alone accounts for most of the speed up by giving:
586.87s user 919.99s system 2612% cpu 57.676 total

2. uvm_pageidlezero

Idle zeroing these days definitely makes no sense on amd64. Any amount of
pages possibly prepared is quickly shredded and vast majority of all
allocations end up zeroing in place. With rep stosb this is even less of
a problem.

Here it turned out to be harmful by inducing avoidable cacheline traffic.

Look at nm kernel | sort -nk 1:
----------------
ffffffff810b8fc0 B uvm_swap_data_lock
ffffffff810b8fc8 B uvm_kentry_lock
ffffffff810b8fd0 B uvm_fpageqlock
ffffffff810b8fd8 B uvm_pageqlock
ffffffff810b8fe0 B uvm_kernel_object
----------------

All these locks false-share a cacheline. In particular fpagqlock is
obstructing uvm_pageqlock.

Attempt to run zeroing performs mutex_tryenter. It uncoditionally does
lock cmpxchg which dirties the cacheline, thus even if zeroing would
end up not being performed the damage was already done. Chances are
succesfull zeroing is also a problem, but that I did not investigate.

Doing #if 0'ing the uvm_pageidlezero call in the idle func shaved about 2
seconds real time:
589.02s user 792.62s system 2541% cpu 54.365 total

This should definitely be disabled for amd64 altogether and probably
removed in general.

3. false sharing

Followed the issue noted earlier I __cacheline_aligned aforementioned
locks. But also moved atomically updated counters out of uvmexp.

uvmexp is full of counters updated with mere increments possibly by
multiple threads, thus the issue of this obj was not resolved.

Nonetheless, said annotations applied combined with the rest give the
improvement mentioned earlier.

======================================

Here is a flamegraph from a fully patched kernel:
https://people.freebsd.org/~mjg/netbsd/build-kernel-j40.svg

And here are top mutex spinners:
59.42 1560022 184255.00 ffffe40138351180       <all>
57.52 1538978 178356.84 ffffe40138351180       uvm_fault_internal+7e0
1.23    8884   3819.43 ffffe40138351180       uvm_unmap_remove+101
0.67   12159   2078.61 ffffe40138351180       cache_lookup+97

(see https://people.freebsd.org/~mjg/netbsd/build-kernel-j40-lockstat.txt )

Note that netbsd`0xffffffff802249ba is x86_pause. Since the function
does not push frame pointer it is shown next to the actual caller, as
opposed to above it. Sometimes called functions get misplaced anyway,
I don't know why.

1. exclusive vnode locking (genfs_lock)

It is used even for path lookup which as can be seen leads to avoidable
contention. From what I'm told the primary reason is ufs constructing
some state just in case it has to create an inode at the end of the lookup.

However, since most lookups are not intended to create anything, this
behavior can be made conditional. I don't know the details, but ufs on
FreeBSD most certainly uses shared locking for common case lookups.

2. uvm_fault_internal

It's shown as the main waiter for a vm obj lock. The flamegraph hints
the real problem is with uvm_pageqlock & friends taken elsewhere, while
most page fault handlers serialize on the vm obj lock, while the holder
waits for uvm_pageqlock.

3. pmap

It seems most issues stem from slow pmap handling. Chances are there are
perfectly avoidable shootdowns and in fact cases where there is no need
to alter KVA in the first place.

4. vm locks in general

Most likely there are trivial cases where operations can be batched,
especially so on process exit where there are multiple pages to operate
on.

======================================

I would like to add a remark about locking primitives.

Today the rage is with MCS locks, which are fine but not trivial to
integrate with sleepable locks like your mutexes. Even so, the current
implementation is significantly slower than it has to be.

First, the lock word is read twice on entry to mutex_vector_enter - once
to determine the lock type and then to read the owner.

Spinning mutexes should probably be handled by a different routine.

lock cmpxchg already returns the found value (the owner). It can be
passed by the assembly routine to the slow path. This allows to make
an initial pass at backoff without accessing the lock in the meantime.
In face of contention the cacheline could have changed ownership by the
time you get to the read, thus using the value we already saw avoids
spurious bus transactions. Given low initial spin loop it should not have
negative effects.

backoff parameters were hardcoded last decade and are really off even
when looking today's modest servers. For kicks I changed the max spin
count to 1024 and in a trivial microbenchmark of doing dup2 + close
in 40 threads I got almost double the throughput.

Interestingly this change caused a regression for kernel build.
I did not investigate, I suspect the cause was that the vm obj lock
holder was now less aggressive on trying to grab the lock and that
caused problmes for everyone else waiting on the vm obj lock.

The spin loop itself is weird in the sense that instead of just having
the pause instruction embedded it calls a function. This is probably
unnecessarily less power/other thread friendly than it needs to be.

Cheers,

Mateusz Guzik <mjguzik gmail.com>