Re: performance issues during build.sh -j 40 kernel

On Sun, Sep 10, 2017 at 06:51:31PM +0100, Mindaugas Rasiukevicius wrote:
> Mateusz Guzik <mjguzik%gmail.com@localhost> wrote:
> > 1. exclusive vnode locking (genfs_lock)
> >
> > ...
> >
> > 2. uvm_fault_internal
> >
> > ...
> >
> > 4. vm locks in general
> >
>
> We know these points of lock contention, but they are not really that
> trivial to fix. Breaking down the UVM pagequeue locks would generally
> be a major project, as it would be the first step towards NUMA support.
> In any case, patches are welcome. :)
>

Breaking locks is of course the preferred long term solution, but also
time consuming. On the other hand there are most likely reasonably easy
fixes consisting of collapsing lock/unlock cycles into just one lock/unlock
etc.

FreeBSD is no saint here either with one global lock for free pages, yet
it manages to work OK-ish with 80 hardware threads and is quite nice
with 40.

That said, I had enough problems $elsewhere to not be interested in
looking too hard here. :>

> > 3. pmap
> >
> > It seems most issues stem from slow pmap handling. Chances are there are
> > perfectly avoidable shootdowns and in fact cases where there is no need
> > to alter KVA in the first place.
>
> At least x86 pmap already performs batching and has quite efficient
> synchronisation logic. You are right that there are some key places
> where avoiding KVA map/unmap would have a major performance improvement,
> e.g. UBC and mbuf zero-copy mechanisms (it could operate on physical
> pages for I/O). However, these changes are not really related to pmap.
> Some subsystems just need an alternative to temporary KVA mappings.
>

I was predominantly looking at teardown of ubc mappings. The flamegraph
suggests overly high cost there.

> >
> > I would like to add a remark about locking primitives.
> >
> > Today the rage is with MCS locks, which are fine but not trivial to
> > integrate with sleepable locks like your mutexes. Even so, the current
> > implementation is significantly slower than it has to be.
> >
> > ...
> >
> > Spinning mutexes should probably be handled by a different routine.
> >
> > ...
> >
>
> I disagree, because this is a wrong approach to the problem. Instead of
> marginally optimising the slow-path (and the more contended is the lock,
> the less impact these micro-optimisations have), the subsystems should be
> refactored to eliminate the lock contention in the first place. Yes, it
> is much more work, but it is the long term fix. Having said that, I can
> see some use cases where MCS locks could be useful, but it is really a low
> priority in the big picture.
>

Locks are fundamentally about damage control. As noted earlier, spurious
bus transaction due to an avoidable read make performance unnecessarily
tad bit worse. That was minor anyway, more important bit was the
backoff.

Even on systems modest by today standards the quality of locking
primitives can be a difference between a system which is slower than
ideal but perfectly usable and one which is just dog slow.

That said, making backoff parameters autoscale on cpus with some kind of
upper cap is definitely warranted.

--
Mateusz Guzik
Swearing Maintenance Engineer