Subject: more on mysql benchmark
To: None <tech-kern@netbsd.org>
From: Chuck Silvers <chuq@chuq.com>
List: tech-kern
Date: 03/06/2005 17:38:09
more progress:  eventually I realized that I still had the mysql config file
set up wrong; after fixing that the results make a lot more sense.  previously
I was using the default mysql cache size (8 MB) instead of the desired 256 MB.
once all that memory was taken up by the mysql cache instead of being used for
caching the data file, the fsync() issue wasn't very significant anymore for
this benchmark.  also, with the tiny mysql cache, direct I/O actually cut the
throughput of the test in half; with the fixed size direct I/O performed
the best of all the combinations, which was more what I expected.

the zero-instead-of-reading change started improving performance during the
"run" phase of the test as well, which I had also expected before.  and then
I realized that the super-naive read ahead algorithm that we have currently
was leading us to read ahead 128 KB on 1/4 of the random reads we did, so
I tried turning that off, but that made things worse again, which didn't
make much sense.  then I realized there was another problem with read-ahead:
we currently only read the larger of one UBC window (8 KB by default) or
1 fs block (8 KB in my fs) at a time, but mysql is doing 16 KB reads.
so we were doing 2 synchronous I/Os instead of 1 for each read().
recreating my fs with a 16 KB block size improved things further,
and made the no-read-ahead change improve things as well.

so starting with the default configuration of 8 KB bsize, default sysctl
settings and the bad mysql configuration, my progression of optimizations
has had the following effect:

change				mysql transactions per second
------				-----------------------------
start				about 3
better sysctl settings		5.30
fix mysql cache size		6.52
use 16 KB fs block size		9.82
zero-instead-of-read		10.22
disable read ahead		11.20
direct I/O + concurrency	11.96
increase mysql cache to 320 MB	12.39

(the last one seems appropriate with direct I/O... we're not using memory
for caching file data anymore, so we might as well use it for mysql's cache.)


after all these changes, the majority of CPU overhead appears to be in
the socket code, like was the case with the super-smack select-key test.


in -current we already default to a 16 KB block size in sysinst, so that part
will be in 3.0.  (yes, we should also fix the code so that it doesn't matter.)
the only other one of these changes which seems safe to make at this point
would be to change the default sysctl settings.  the ones I suggested to tony
were a little extreme, but perhaps these would be good:


vm.filemin=5
vm.filemax=20
vm.anonmin=80
vm.anonmax=90
vm.execmin=5
vm.execmax=30


this is probably closer to what various people think is appropriate
in general anyway.  comments?


overall, this exercise has exposed a number of issues that we should address:

 - our default memory-usage-balancing isn't good for this kind of application.
   it's unclear that the current set of sysctl knobs is adequate.
 - the buffered I/O code needs to become smart enough to avoid reading pages
   when they're just going to be overwritten.
 - the read ahead code should avoid reading ahead until there is evidence of
   sequential access.
 - the read code should make sure that medium size (more than a page, smaller
   than MAXPHYS) reads use the minimum number of disk I/Os (ie. 1) to fetch
   the data.
 - the fsync() / vm-pager-put code needs to be able to find pages efficiently,
   which means tracking page dirtiness indexed by uvm_object.
 - we ought to implement direct I/O with something to allow concurrent reads
   and writes, since that performs best for database-like applications.


-Chuck