Subject: Re: 10M row test in that MySQL benchmark
To: Hubert Feyrer <hubert@feyrer.de>
From: Chuck Silvers <chuq@chuq.com>
List: tech-perform
Date: 02/24/2005 19:44:38
hi,

On Wed, Feb 09, 2005 at 10:16:07PM +0100, Hubert Feyrer wrote:
> 
> Re: http://software.newsforge.com/article.pl?sid=04/12/27/1243207&from=rss

I've been talking with the author of this article and looking into
the issues here some.


> I've talked to the author of the article, and he aparently really didn't 
> set PTHREAD_CONCURRENCY (caused by it being not documented in NetBSD 
> 2.0); He wants to create some addendum, so we can look forward to an 
> update.

setting this PTHREAD_CONCURRENCY=2 on our 2-CPU boxes actually reduces
the throughput of the super-smack select-key test slightly.
we go from being I/O bound (CPUs about 40% busy) to being CPU-bound
(obviously CPUs 100% busy), but without benefit to the application.

I ran the test under a profiling kernel, and here's the output:

PTHREAD_CONCURRENCY=1

167 p3:~ # time super-smack /usr/share/smacks/select-key.smack 10 10000
Query Barrel Report for client smacker1
connect: max=59ms  min=1ms avg= 15ms from 10 clients 
Query_type      num_queries     max_time        min_time        q_per_s
select_index    200000  0       0       9636.62
6.305u 7.354s 0:21.13 64.6%     0+0k 0+0io 0pf+0w

top during the run:
CPU0 states: 13.9% user,  0.0% nice, 18.8% system,  0.0% interrupt, 67.3% idle
CPU1 states: 29.7% user,  0.0% nice, 23.8% system,  0.0% interrupt, 46.5% idle

  PID USERNAME PRI NICE   SIZE   RES STATE      TIME   WCPU    CPU COMMAND
after startup:
 2397 mysql     18    0   324M   18M sigwai/0   0:00  1.15%  0.49% mysqld
after run:
 2397 mysql     18    0   324M   21M sigwai/0   0:10 18.04% 16.89% mysqld

  %   cumulative   self              self     total           
 time   seconds   seconds    calls  us/call  us/call  name    
 72.47     40.04    40.04                             idle_loop
 10.95     46.09     6.05     7330   825.38   825.38  Xspllower
  2.50     47.47     1.38  3738039     0.37     1.14  soreceive
  1.45     48.27     0.80  6342632     0.13     0.13  _kernel_lock
  1.09     48.87     0.60  4153108     0.14     0.17  sodopendfree
  1.05     49.45     0.58  5172723     0.11     1.63  syscall_plain
  0.85     49.92     0.47  5805505     0.08     0.08  i386_copyin
  0.71     50.31     0.39  4413531     0.09     0.09  _kernel_lock_acquire_count
  0.70     50.70     0.39                             syscall1




PTHREAD_CONCURRENCY=2

165 p3:~ # time super-smack /usr/share/smacks/select-key.smack 10 10000
Query Barrel Report for client smacker1
connect: max=12ms  min=1ms avg= 7ms from 10 clients 
Query_type      num_queries     max_time        min_time        q_per_s
select_index    200000  1       0       9347.77
8.054u 14.502s 0:21.96 102.6%   0+0k 0+2io 0pf+0w

top during the run:
CPU0 states: 45.3% user,  0.0% nice, 45.3% system,  0.0% interrupt,  9.5% idle
CPU1 states: 33.8% user,  0.0% nice, 63.2% system,  0.0% interrupt,  3.0% idle

  PID USERNAME PRI NICE   SIZE   RES STATE      TIME   WCPU    CPU COMMAND
after startup:
 1976 mysql     18    0   324M   18M sigwai/0   0:03  0.00%  0.00% mysqld
after run:
 1976 mysql     18    0   324M   21M sigwai/0   1:06 18.32% 18.26% mysqld

  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 55.13     23.80    23.80     2955     8.05     8.05  Xspllower
  7.25     26.93     3.13  6579165     0.00     0.00  _kernel_lock
  4.56     28.90     1.97  4355165     0.00     0.00  _kernel_lock_acquire_count
  4.01     30.63     1.73  3685574     0.00     0.00  soreceive
  2.90     31.88     1.25  4160068     0.00     0.00  sodopendfree
  2.48     32.95     1.07  5623640     0.00     0.00  i486_copyout
  2.36     33.97     1.02  4147578     0.00     0.01  syscall_plain
  1.92     34.80     0.83  5662118     0.00     0.00  i386_copyin
  1.67     35.52     0.72                             idle_loop



unfortunately, most of the time is being spent with clock interrupts blocked,
and with the i386 implementation of splx(), that means that we can't tell
where it's all going.  I tried to make an inline-assembly version of splx()
a year or two ago, but it got ugly and I lost motivation.  anyone want to
pick that up?



> A different question is, why NetBSD performed so bad in the 10M Rows test
> (http://www.newsforge.com/blob.pl?id=82311a9e7896a841032c395f270d6a0f).
> According to the article it's not a softdep or atime issue.


this one appears to be mostly due to our crappy page-replacement algorithm.
these sysctl settings more than doubled the throughput for tony:

vm.filemin=1
vm.filemax=20
vm.anonmin=80
vm.anonmax=99

this has the effect of keeping the mysql cache in memory.



I tried a couple additional changes to see if we could improve this further:

 (1) the loaning-for-read() stuff that we've discussed lately on tech-kern

 (2) another optimization that I haven't mentioned before, where we zero
     pages that are being overwritten with write() instead of reading them
     from disk.  (we have to do something to initialize the page before
     unlocking it so that other threads can't look at the page's old contents).
     (this only took about 1/2 hour to implement, so it's basically a backup
     plan if loaning-for-write() turns out to be too hard or too risky for
     the 2.1 and 3.0 releases.)


I measured all four combinations of this for both the "prepare" and "run"
phases of sysbench.  the results were:

options					prepare phase	run phase
					(in minutes)	transactions/sec
domaploanobj=0 chuq_dozero=0		52:40		5.56
domaploanobj=0 chuq_dozero=1		44:37		5.59
domaploanobj=1 chuq_dozero=0		52:57		6.84
domaploanobj=1 chuq_dozero=1		48:18		6.83


zeroing instead of reading for overwrites helps during prepare
and has no effect during run.  loaning-for-read(), on the other hand,
helps during run, but hurts during prepare.



I also ran this under a profiling kernel, and here were the results:

prepare
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 44.13    263.91   263.91   475763     0.55     0.56  genfs_putpages
 13.68    345.74    81.83  2438342     0.03     0.03  i386_copyin
  7.11    388.24    42.50 379417126     0.00     0.00  pmap_clear_attrs
  5.19    419.29    31.05  1704374     0.02     0.02  memset
  2.74    435.68    16.39 380800608     0.00     0.00  pmap_tlb_shootnow
  2.40    450.04    14.36   121860     0.12     0.12  Xspllower
  2.28    463.69    13.65  1705310     0.01     0.01  i486_copyout
  0.97    469.51     5.82  8968996     0.00     0.00  lockmgr

run
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 34.52    370.95   370.95   709745     0.52     0.52  genfs_putpages
  9.74    475.64   104.69 15372348     0.01     0.01  i386_copyin
  8.30    564.79    89.15                             mpidle
  6.67    636.50    71.71 16511526     0.00     0.00  i486_copyout
  5.51    695.74    59.24 529689671     0.00     0.00  pmap_clear_attrs
  4.86    747.99    52.25   227478     0.23     0.23  Xspllower
  3.21    782.47    34.48  1941078     0.02     0.02  memset
  2.17    805.80    23.33 532201987     0.00     0.00  pmap_tlb_shootnow
  0.95    816.02    10.22  6513348     0.00     0.01  soreceive
  0.85    825.19     9.17 13135256     0.00     0.00  lockmgr
  0.84    834.24     9.05 18082701     0.00     0.00  uvm_pagelookup
  0.78    842.61     8.37  3175090     0.00     0.02  genfs_getpages
  0.69    850.04     7.43  3482626     0.00     0.12  uvm_fault
  0.59    856.38     6.34  1595193     0.00     0.01  pmap_remove_ptes
  0.50    861.74     5.36 13264280     0.00     0.03  syscall_plain
  0.49    866.98     5.24 14308126     0.00     0.00  pvtree_SPLAY_MINMAX
  0.48    872.18     5.20 23920121     0.00     0.00  pvtree_SPLAY


the database file vnode has a bunch of pages in memory, and
genfs_getpages() has to look at all of them when mysqld does an fsync().
we're clearly hurt by our inability to do unbuffered I/O.



I also ran the super-smack select-key test with the same combinations
of the two optimizations, but all the runs had the same throughput.

the kernel profiling data for this benchmark was:

  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 26.31      9.69     9.69     2384     4.06     4.06  Xspllower
 13.79     14.77     5.08  3800113     0.00     0.00  soreceive
  5.43     16.77     2.00  6006860     0.00     0.00  i386_copyin
  5.21     18.69     1.92  5404173     0.00     0.01  syscall_plain
  4.97     20.52     1.83  4200256     0.00     0.00  uipc_usrreq
  4.72     22.26     1.74  4400953     0.00     0.00  uiomove
  4.66     23.98     1.72                             syscall1
  3.31     25.20     1.22  4628997     0.00     0.00  i486_copyout
  3.10     26.34     1.14   400053     0.00     0.01  sosend
  2.86     27.39     1.05                             Xsyscall
  2.74     28.40     1.01  4200177     0.00     0.00  sodopendfree


again the i386 splx() gets in the way.  but since this time we're
just as interested in single-CPU performance, I ran it again on a macppc,
where the splx() implementation doesn't get in the way so much:

  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 19.61      2.34     2.34 15445156     0.00     0.00  splraise
 19.28      4.64     2.30 15444019     0.00     0.00  splx
  9.72      5.80     1.16                             s_sctrap
  9.56      6.94     1.14      439     2.60     2.60  cpu_idlespin
  5.70      7.62     0.68  3800116     0.00     0.00  soreceive
  3.94      8.09     0.47  4200262     0.00     0.00  uipc_usrreq
  3.86      8.55     0.46  5405542     0.00     0.00  syscall_plain
  3.52      8.97     0.42  5219607     0.00     0.00  memcpy
  3.02      9.33     0.36  4609360     0.00     0.00  copyout
  2.68      9.65     0.32  4401865     0.00     0.00  uiomove
  2.26      9.92     0.27  3802081     0.00     0.00  dofileread
  1.76     10.13     0.21   400437     0.00     0.00  enable_fpu

                0.00    0.00    6779/15445156     uvm_lock_fpageq [92]
                0.03    0.00  200502/15445156     preempt [30]
                0.03    0.00  200969/15445156     wakeup <cycle 1> [42]
                0.03    0.00  201039/15445156     ltsleep <cycle 1> [35]
                0.15    0.00 1000108/15445156     sosend [14]
                0.18    0.00 1200108/15445156     m_gethdr [22]
                0.64    0.00 4200181/15445156     sodopendfree [10]
                1.27    0.00 8400357/15445156     soreceive [6]
[8]     19.6    2.34    0.00 15445156         splraise [8]

                0.00    0.00    2589/16282037     decr_intr [76]
                0.03    0.02  200502/16282037     preempt [30]
                0.15    0.08 1000108/16282037     sosend [14]
                0.18    0.09 1200108/16282037     m_gethdr [22]
                0.63    0.33 4200181/16282037     sodopendfree [10]
                1.26    0.65 8400357/16282037     soreceive [6]
[9]     19.3    2.30    0.00 15444019         splx <cycle 1> [9]
                                 546             do_pending_int <cycle 1> [185]


so it's not a definitive answer, but it's nearly certain that
the socket code is a major bottleneck for super-smack select-key.


-Chuck