[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: benchmark results on ryzen 3950x with netbsd-9, -current, and -current (no DIAGNOSTIC)
On Tue, Mar 03, 2020 at 08:25:25PM +1100, matthew green wrote:
> here are a few build benchmark tests on an amd ryzen 3950x
> system, to see the cumulative effect of the various fixes we've
> seen since netbsd-9, for this 16 core/ 32 thread CPU, 64GB of
> ram, separate nvme ssd for src & obj.
Cool! Thank you very much for doing this. Really interesting to see these
> below has a full summary, but the highlights:
> - building kernels into tmpfs is 10-12% faster
> - DIAGNOSTIC costs 3-8%
> - current's better CPU thread aware scheduler helps -j16 on
> 32 CPUs significantly (more than double benefit compared
> to the other tests.)
> - "build.sh release" is about 10% faster
> - kernel builds are similar about 10% faster
> - builds of ircII are 22% faster, though configure only 11%
> - -j32 is faster than -j16 or -j24
> - -j40 is not much worse than -j32, occasinally faster
> and the one lowlight:
> - "du -mcs *" on a src tree already in ram has lost about 30%
> performance, though still under 1 second.
OK that's intriguing and something that can hopefully be diagnosed with
a bit of dtrace. I'm not aware of a reason for it.
> time for amd64 GENERIC kernel builds:
> -j16 -j24 -j32 -j40
> netbsd-9 2m26.56 1m55:30 1m43:46 1m43:82
> current (DIAG) 2m01.25 1m46.84 1m40.22 1m41.12
> current 1m54.56 1m39.57 1m33.09 1m34.06
Another perspective and a couple of observations: this is from my dual
socket system with 24 cores / 48 threads total, running -current and
building GENERIC on a SATA disk. This is with -j48.
132.53s real 1501.10s user 3291.25s system nov 2019
86.29s real 1537.95s user 786.29s system mar 2020
79.16s real 1602.97s user 419.54s system mar 2020 !DIAGNOSTIC
I agree with Greg, the picture with DIAGNOSTIC isn't good. I think one of
the culprits may be in the x86 pmap where it regularly scans all CPUs
checking if a pmap is in use - that should probably be DEBUG. Beyond that,
I don't have good ideas (other than investigation warranted).
In the case of the dual socket system, the difference is pronounced and my
take on it is that contention stresses the interconnect, and backpressure is
then exerted on every CPU in the system, not just those CPUs actively
contending with others. With a single socket system that kind of fight
stays on chip in the cache.
Main Index |
Thread Index |