Subject: Re: Thread benchmarks
To: Warner Losh <>
From: Andrew Doran <>
List: tech-kern
Date: 10/01/2007 14:33:35
On Fri, Sep 28, 2007 at 02:04:00PM -0600, Warner Losh wrote:

> > Back in March I posted some MySQL benchmarks after we switched to a 1:1
> > threading model in -current *. I've spent a lot of time tuning the pthread
> > library so I thought I'd post a followup. The original benchmark that I used
> > (supersmack) now performs much better on -current that it did a few months
> > ago, so I picked something else this time: MySQL sysbench.
> > 
> > Most of the sysbench runs that I've seen to date have sysbench running on
> > the same machine as the database. That's a good test but with the exception
> > of small installations and out-of-band activity, production setups rarely
> > look like that. So I ran sysbench itself on a seperate dual core system.
> > 
> > Here are the results, comparing NetBSD 3 with NetBSD-current:
> > 
> >
> > 
> > And NetBSD-current compared to other systems:
> > 
> >
> > 
> > Note this is stock NetBSD-current with FreeBSD's malloc() (jemalloc) in
> > libc. I'll be merging that some time soon.
> Which kernel config did you use for the FreeBSD results?

I took the generic config, removed the debugging options (INVARIANTS,
WITNESS and whatever else I could find) and added SCHED_ULE.

> In tests that have been run on p4 hardware, the FreeBSD system's graph
> looks more like NetBSD's than the one presented here.  FreeBSD's kernel
> has a lot of debugging options that hurt performance on by default.  Also,
> FreeBSD's malloc defaults to 'AJ' in head, which would result in reduced
> performance.

I can try turning off debugging in the allocator. What else would you like
me to try? I would like to provide remote access to the two systems but
unfortunatley my Internet link is unreliable and I'm not in a position to
leave them on 24x7. Some details on the test. I grabbed my.cnf from Jeff
Roberson's weblog:

Relevant bits of dmesg from the MySQL host:

total memory = 2047 MB
avail memory = 2008 MB
cpu0: Intel Pentium III Xeon (686-class), 701.64 MHz, id 0x6a1
cpu0: features 383fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR>
cpu0: features 383fbff<PGE,MCA,CMOV,PAT,PSE36,MMX>
cpu0: features 383fbff<FXSR,SSE>
cpu0: I-cache 16 KB 32B/line 4-way, D-cache 16 KB 32B/line 4-way
cpu0: L2 cache 1 MB 32B/line 8-way
cpu0: ITLB 32 4 KB entries 4-way, 2 4 MB entries fully associative
cpu0: DTLB 64 4 KB entries 4-way, 8 4 MB entries 4-way
fxp0 at pci1 dev 6 function 0: i82559 Ethernet, rev 8
fxp0: interrupting at ioapic0 pin 3 (irq 3)
fxp0: Ethernet address 00:02:a5:45:a6:48
inphy0 at fxp0 phy 1: i82555 10/100 media interface, rev. 4
inphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto

The disk subsystem doesn't matter since I was running the read-only test,
and with 10000 rows everything fits in core. I compiled MySQL by hand on
each system:

./configure --prefix=/local/mysql --with-pthread --with-innodb

Everything but necessary processes were killed on the two systems, so they
were running at most sshd, screen, sysbench and the minimum to be able to
log in. I did a warm-up run and then started testing:

for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20; do
        echo "=> ${i} THREADS"
        sysbench --test=oltp --db-driver=mysql --mysql-host=${HOST} \
            --mysql-user=root --mysql-table-engine=innodb --num-threads=${i} \
            --max-time=60 --max-requests=0 --oltp-read-only=on run | \
	    tee -a ${HOST}.txt

The two systems are connected via 100Mbps switch. The sysbench host was
running NetBSD/i386 4.99.30 and has a dual core CPU:

cpu0 at mainbus0 apid 0: (boot processor)
cpu0: Intel (686-class), 3200.24 MHz, id 0xf64
cpu0: features bfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR>
cpu0: features bfebfbff<PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX>
cpu0: features bfebfbff<FXSR,SSE,SSE2,SS,HTT,TM,SBF>
cpu0: features2 e4bd<SSE3,MONITOR,DS-CPL,VMX,EST,CID,xTPR,PDCM>
cpu0: features3 20100000<XD,EM64T>
cpu0: "Intel(R) Pentium(R) D CPU 3.20GHz"
cpu0: I-cache 12K uOp cache 8-way
cpu0: L2 cache 2 MB 64B/line 8-way
cpu0: ITLB 4K/4M: 128 entries
cpu0: DTLB 4K/4M: 64 entries