[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: kern/53124 (FFS is slow because pmap_update doesn't scale)
The following reply was made to PR kern/53124; it has been noted by GNATS.
From: mlelstv%serpens.de@localhost (Michael van Elst)
Subject: Re: kern/53124 (FFS is slow because pmap_update doesn't scale)
Date: Sun, 25 Mar 2018 10:35:52 -0000 (UTC)
max%m00nbsd.net@localhost (Maxime Villard) writes:
> o Does the 'UMAP_MAPPING_CACHED' branch get taken the same way with and
> without your N-1 user threads. If there is a clear difference here, then
> it means the problem is that UBC does not scale. Otherwise:
The branch is always taken the same way. It's the same number of TLB
shootdowns. pmap_update just completes faster when the other CPUs
are running in userland.
A TLB shootdown for pmap_kernel is sent to:
kcpuset_running is filled by a CPU reaching the idle loop and
apprently is not cleared (putting a CPU offline should probably
clear it, but that only happens during ACPI sleep).
So that should be independent on what pmap is used by the other
But maybe it depends on wether a CPU is idle or not. For a test
I disabled the acpicpu module, this changes machdep.idle-mechanism
from acpi to halt (why not mwait?)
As a result, reading from cache sped up from ~145MB/s to 600MB/s
(16 core with HT) and from ~300MB/s to ~1GB/s (16 core without HT).
In the first case, running up to 15 infinite loops didn't change
anything. With 24 infinite loops, we are at ~1GB/s. With 31 loops,
we are at 1.5GB/s.
So for one, the acpi idle-mechanism has a larger wakeup latency
for handling the IPI than halt, and the latency for a running process
is low enough so that you don't see the bad scaling.
One optimization would therefore be to skip idle CPUs when flushing
the TLB and to catch up when leaving the idle loop. This is not
trivial, as "leaving" also includes interrupt handlers.
Michael van Elst
"A potential Snark may lurk in every tree."
Main Index |
Thread Index |