Subject: Re: direct I/O
To: Jonathan Stone <jonathan@dsg.stanford.edu>
From: Chuck Silvers <chuq@chuq.com>
List: tech-kern
Date: 03/03/2005 17:28:26
On Tue, Mar 01, 2005 at 10:59:29AM -0800, Jonathan Stone wrote:
> In message <20050301103057.GE10259@spathi.chuq.com> Chuck Silvers writes
> >there appear to be two problems with the netbsd in the default configuration
> >in the 10M Rows benchmark:
> >
> > - our default settings for memory-usage balancing cause mysql's cache to
> >   be paged out.
> > - even with settings that prevent the mysql cache from being paged out,
> >   fsync() on a file with lots of pages in memory takes a lot of CPU time.
...
> Forgive me repeating the question, but: since FreeBSD does so much
> better on this benchmark, do we know what FreeBSD does differently?
> E.g., does mysql on FreeBSD automagically use O_DIRECT (with
> concurrent reads and writes?) or does it exercise a completely
> different codepath?

no, mysql uses buffered I/O by default (and in the benchmark article)
on all platforms.  the difference is that freebsd does not have the two
problems with buffered I/O that I mentioned above.

as for the first problem, I don't know how freebsd's page replacement scheme
works.  ours is known to not be very good.

as for the second problem, I looked at the profile data in a little more detail.
here's the profile data I posted earlier about the sysbench 10M Rows test
(which had the sysctl vm.* knobs set to avoid the first problem above):

  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 34.52    370.95   370.95   709745     0.52     0.52  genfs_putpages
  9.74    475.64   104.69 15372348     0.01     0.01  i386_copyin
  8.30    564.79    89.15                             mpidle
  6.67    636.50    71.71 16511526     0.00     0.00  i486_copyout
  5.51    695.74    59.24 529689671     0.00     0.00  pmap_clear_attrs
  4.86    747.99    52.25   227478     0.23     0.23  Xspllower
  3.21    782.47    34.48  1941078     0.02     0.02  memset
  2.17    805.80    23.33 532201987     0.00     0.00  pmap_tlb_shootnow
  0.95    816.02    10.22  6513348     0.00     0.01  soreceive
  0.85    825.19     9.17 13135256     0.00     0.00  lockmgr
  0.84    834.24     9.05 18082701     0.00     0.00  uvm_pagelookup
  0.78    842.61     8.37  3175090     0.00     0.02  genfs_getpages
  0.69    850.04     7.43  3482626     0.00     0.12  uvm_fault
  0.59    856.38     6.34  1595193     0.00     0.01  pmap_remove_ptes
  0.50    861.74     5.36 13264280     0.00     0.03  syscall_plain
  0.49    866.98     5.24 14308126     0.00     0.00  pvtree_SPLAY_MINMAX
  0.48    872.18     5.20 23920121     0.00     0.00  pvtree_SPLAY


                              709745             VOP_PUTPAGES <cycle 4> [163]
[6]     34.7  370.95    1.52  709745         genfs_putpages <cycle 4> [6]
                0.72    0.00 1431958/18082701     uvm_pagelookup [60]
                0.29    0.34 1071761/2968820     uvm_pagefree [131]


                               17398             ffs_full_fsync <cycle 4> [320]
                1.67    0.33  156333/46616067     ffs_write [11]
                5.72    1.12  536002/46616067     uvmpd_scan_inactive [38]
[163]    0.1    0.15    0.86  709745         VOP_PUTPAGES <cycle 4> [163]
                0.32    0.54  709745/709745      ffs_putpages [177]
                              709745             genfs_putpages <cycle 4> [6]



we can't tell from this how much CPU time each of the three main callers of
VOP_PUTPAGES() is responsible for, so just to be sure, I made three copies
of genfs_putpages() and changed the original function to call a different copy
depending on who called it.  the results were:

  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 33.94    371.10   371.10    17355    21.38    21.38  genfs_putpages_fsync
 10.96    490.94   119.84                             mpidle
  9.75    597.60   106.66 15507996     0.01     0.01  i386_copyin
  6.44    667.99    70.39 16654868     0.00     0.00  i486_copyout
  5.37    726.73    58.74 519864012     0.00     0.00  pmap_clear_attrs
  4.92    780.52    53.79   228223     0.24     0.24  Xspllower
  3.08    814.19    33.67  1925542     0.02     0.02  memset
...
  0.13   1002.59     1.38   157329     0.01     0.01  genfs_putpages_flushbehind
...
  0.06   1046.28     0.70   535569     0.00     0.00  genfs_putpages_pageout
...
  0.01   1088.65     0.11   710253     0.00     0.00  genfs_putpages




so the extra system CPU time is being used by the processing for fsync().

I think that freebsd doesn't use so much CPU time for fsync() because in
their world, fsync() doesn't look for dirty pages, it looks for dirty buffers.
I can't find the code that does this right now, but I vaguely recall that
dirty pages are tracked in their buffer cache by B_DELWRI buffers, so they
can find the dirty pages via the dirty buffers, which are kept on a separate
list from the clean buffers.

however, as far as I can tell, linux does not track clean and dirty pages
separately such that one can find the dirty pages without looking through
the clean ones, and last I heard, solaris didn't either.  yet both of them
had even better performance than freebsd in the article.


-Chuck