Subject: Re: Thread benchmarks, round 2
To: Andrew Doran <ad@NetBSD.org>
From: Kris Kennaway <kris@FreeBSD.org>
Date: 10/05/2007 11:18:24
Andrew Doran wrote:
> So, I learned a few things since I put up the previous set of benchmarks:
> - The erratic behaviour from Linux is due to the glibc memory allocator.
> Using Google's tcmalloc, the problem disappears.
Well you have to be careful there, tcmalloc apparently defers frees, and
is not really a general purpose malloc. The linux performance problems
are (were? I haven't tried recent kernels) real though.
> - I missed a few things when porting jemalloc from FreeBSD. One of them
> was fairly major. Due to my mistake jemalloc on NetBSD was, basically,
> single threaded. That said it did show a noticable improvement over
> - There was a nasty performance bug in NetBSD's pthread mutexes, which
> is now fixed. libpthread has also had a couple more tweaks for performance
> that have had a positive impact.
> - The memory allocator used has a significant effect on sysbench itself:
> it needs to be multithreaded.
> - Mindaugas has made more improvements to his scheduler and these are
> showing a really positive effect.
> So after making some changes to NetBSD, and changes to how I'm benchmarking
> the systems, I have rerun them. In contrast to the previous runs, this one
> is done locally:
I am somewhat surprised by this, because on FreeBSD it is really not
spending much time in the kernel (only ~20% system time), so there does
not seem to be much scope for a 10% performance difference. Also it
took quite a lot of work to optimize locking of various kernel
subsystems that are used by this workload, and until that point there
was significant kernel lock contention which reduced performance by tens
of percent. I would have expected this to matter on NetBSD - even with
the vmlocking work there is still more to go.
I will try to reproduce this on my own hardware (see below).
> Kris Kennaway has kindly offered to try NetBSD on an 8-way system. I expect
> that NetBSD will hit a fairly clear ceiling due to poll, fcntl and socket
> I/O causing contention on kernel_lock. It will be interesting to see.
Here is the initial run with CVS HEAD sources (I took out the obvious
things from GENERIC.MP like I386_CPU support, etc, and removed the
default datasize and stack size limits). Same benchmark config that
Andrew is using, etc.
There are a couple of things to note:
* the drop-off above 8 threads on FreeBSD is due to non-scalability of
mysql itself. i.e. it comes from pthread mutex contention in userland.
This is the only relevant lock contention point in the FreeBSD kernel
on this workload. There are some things we can do in libpthread to
mitigate the performance loss in the over-contended pthread situation,
but we haven't done them yet.
* The tail end of the graph is somewhat noisy, which is the reason for
the jump at 19 threads (I only graphed a single run). The distribution
at 20 clients looks like:
| x x |
|x x x xxx x x xx x x xxx x xx|
| |_______________A_M_____________| |
N Min Max Median Avg Stddev
x 20 2326.01 2758.86 2586.47 2572.856 116.69937
Next, to try and reproduce Andrew's result, I disabled 4 CPUs (using
cpuctl in NetBSD) and compared FreeBSD and NetBSD again. I didnt do a
full graph yet, but the results are consistent with what I saw on 8 CPUs.
Note that these are lower but not too different from the NetBSD values
when all 8 CPUs are in use.
The 4 thread performance is basically identical to the 8 CPU case,
showing that the FreeBSD scaling graphed on 8 CPUs is the same as on 4
CPUs (but without the tail since mysql contention is now rate-limited),
i.e. FreeBSD is continuing to scale linearly.
This measurement shows that FreeBSD is performing 70-80% better than
NetBSD in this 4 CPU configuration. This is in contrast to Andrew's
findings which seem to show NetBSD performing 10% better than FreeBSD on
a 4 CPU system (a very old one though).
I will try later with the experimental kernel Andrew sent me (which
includes the new scheduler). If it indeed gives a 100% performance
improvement that would be a significant result :-)