tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: performance issues during build.sh -j 40 kernel



Mateusz Guzik <mjguzik%gmail.com@localhost> wrote:
> ...
> 
> 1) #define UBC_NWINS 1024
> 
> The parameter was set in 2001 and is used on amd64 to this very day.
> 
> lockstat says:
>  51.63  585505 321201.06 ffffe4011d8304c0       <all>
>  40.39  291550 251302.17 ffffe4011d8304c0       ubc_alloc+69
>   9.13  255967  56776.26 ffffe4011d8304c0       ubc_release+a5
>   1.72   35632  10680.06 ffffe4011d8304c0       uvm_fault_internal+532
> [snip]
> 
> The contention is on the global ubc vmobj lock just prior to hash lookup.
> I recompiled the kernel with randomly slapped value of 65536 and the the
> problem cleared itself with ubc_alloc going way down.
> 
> I made no attempts to check what value makes sense or how to autoscale it.
> ...

Yes, ubc_nwins should be auto-tuned, I'd say depending on the physical
memory size and the number of CPUs (as some weighted multiplier).

> 2. uvm_pageidlezero
> 
> Idle zeroing these days definitely makes no sense on amd64. Any amount of
> pages possibly prepared is quickly shredded and vast majority of all
> allocations end up zeroing in place. With rep stosb this is even less of
> a problem.

My feeling is the same: on heavily loaded systems the pressures are too
high and for idling systems it's not worth the hassle.  However, I guess
others might have a different feeling.  More benchmarks and analysis could
settle this.

> 3. false sharing
> 
> Followed the issue noted earlier I __cacheline_aligned aforementioned
> locks. But also moved atomically updated counters out of uvmexp.
> 
> uvmexp is full of counters updated with mere increments possibly by
> multiple threads, thus the issue of this obj was not resolved.
> 
> Nonetheless, said annotations applied combined with the rest give the
> improvement mentioned earlier.

Yes, although if they get significantly contended, they should be moved
out to struct uvm_cpu and/or percpu(9) API and aggregated on collection.
It depends on the counter, of course.

> 1. exclusive vnode locking (genfs_lock)
> 
> ...
> 
> 2. uvm_fault_internal
> 
> ...
> 
> 4. vm locks in general
> 

We know these points of lock contention, but they are not really that
trivial to fix.  Breaking down the UVM pagequeue locks would generally
be a major project, as it would be the first step towards NUMA support.
In any case, patches are welcome. :)

> 3. pmap
> 
> It seems most issues stem from slow pmap handling. Chances are there are
> perfectly avoidable shootdowns and in fact cases where there is no need
> to alter KVA in the first place.

At least x86 pmap already performs batching and has quite efficient
synchronisation logic.  You are right that there are some key places
where avoiding KVA map/unmap would have a major performance improvement,
e.g. UBC and mbuf zero-copy mechanisms (it could operate on physical
pages for I/O).  However, these changes are not really related to pmap.
Some subsystems just need an alternative to temporary KVA mappings.

> 
> I would like to add a remark about locking primitives.
> 
> Today the rage is with MCS locks, which are fine but not trivial to
> integrate with sleepable locks like your mutexes. Even so, the current
> implementation is significantly slower than it has to be.
> 
> ...
> 
> Spinning mutexes should probably be handled by a different routine.
> 
> ...
> 

I disagree, because this is a wrong approach to the problem.  Instead of
marginally optimising the slow-path (and the more contended is the lock,
the less impact these micro-optimisations have), the subsystems should be
refactored to eliminate the lock contention in the first place.  Yes, it
is much more work, but it is the long term fix.  Having said that, I can
see some use cases where MCS locks could be useful, but it is really a low
priority in the big picture.

-- 
Mindaugas


Home | Main Index | Thread Index | Old Index