Subject: Re: direct I/O
To: Jonathan Stone <jonathan@dsg.stanford.edu>
From: Chuck Silvers <chuq@chuq.com>
List: tech-kern
Date: 03/03/2005 17:28:26
On Tue, Mar 01, 2005 at 10:59:29AM -0800, Jonathan Stone wrote:
> In message <20050301103057.GE10259@spathi.chuq.com> Chuck Silvers writes
> >there appear to be two problems with the netbsd in the default configuration
> >in the 10M Rows benchmark:
> >
> > - our default settings for memory-usage balancing cause mysql's cache to
> > be paged out.
> > - even with settings that prevent the mysql cache from being paged out,
> > fsync() on a file with lots of pages in memory takes a lot of CPU time.
...
> Forgive me repeating the question, but: since FreeBSD does so much
> better on this benchmark, do we know what FreeBSD does differently?
> E.g., does mysql on FreeBSD automagically use O_DIRECT (with
> concurrent reads and writes?) or does it exercise a completely
> different codepath?
no, mysql uses buffered I/O by default (and in the benchmark article)
on all platforms. the difference is that freebsd does not have the two
problems with buffered I/O that I mentioned above.
as for the first problem, I don't know how freebsd's page replacement scheme
works. ours is known to not be very good.
as for the second problem, I looked at the profile data in a little more detail.
here's the profile data I posted earlier about the sysbench 10M Rows test
(which had the sysctl vm.* knobs set to avoid the first problem above):
% cumulative self self total
time seconds seconds calls ms/call ms/call name
34.52 370.95 370.95 709745 0.52 0.52 genfs_putpages
9.74 475.64 104.69 15372348 0.01 0.01 i386_copyin
8.30 564.79 89.15 mpidle
6.67 636.50 71.71 16511526 0.00 0.00 i486_copyout
5.51 695.74 59.24 529689671 0.00 0.00 pmap_clear_attrs
4.86 747.99 52.25 227478 0.23 0.23 Xspllower
3.21 782.47 34.48 1941078 0.02 0.02 memset
2.17 805.80 23.33 532201987 0.00 0.00 pmap_tlb_shootnow
0.95 816.02 10.22 6513348 0.00 0.01 soreceive
0.85 825.19 9.17 13135256 0.00 0.00 lockmgr
0.84 834.24 9.05 18082701 0.00 0.00 uvm_pagelookup
0.78 842.61 8.37 3175090 0.00 0.02 genfs_getpages
0.69 850.04 7.43 3482626 0.00 0.12 uvm_fault
0.59 856.38 6.34 1595193 0.00 0.01 pmap_remove_ptes
0.50 861.74 5.36 13264280 0.00 0.03 syscall_plain
0.49 866.98 5.24 14308126 0.00 0.00 pvtree_SPLAY_MINMAX
0.48 872.18 5.20 23920121 0.00 0.00 pvtree_SPLAY
709745 VOP_PUTPAGES <cycle 4> [163]
[6] 34.7 370.95 1.52 709745 genfs_putpages <cycle 4> [6]
0.72 0.00 1431958/18082701 uvm_pagelookup [60]
0.29 0.34 1071761/2968820 uvm_pagefree [131]
17398 ffs_full_fsync <cycle 4> [320]
1.67 0.33 156333/46616067 ffs_write [11]
5.72 1.12 536002/46616067 uvmpd_scan_inactive [38]
[163] 0.1 0.15 0.86 709745 VOP_PUTPAGES <cycle 4> [163]
0.32 0.54 709745/709745 ffs_putpages [177]
709745 genfs_putpages <cycle 4> [6]
we can't tell from this how much CPU time each of the three main callers of
VOP_PUTPAGES() is responsible for, so just to be sure, I made three copies
of genfs_putpages() and changed the original function to call a different copy
depending on who called it. the results were:
% cumulative self self total
time seconds seconds calls ms/call ms/call name
33.94 371.10 371.10 17355 21.38 21.38 genfs_putpages_fsync
10.96 490.94 119.84 mpidle
9.75 597.60 106.66 15507996 0.01 0.01 i386_copyin
6.44 667.99 70.39 16654868 0.00 0.00 i486_copyout
5.37 726.73 58.74 519864012 0.00 0.00 pmap_clear_attrs
4.92 780.52 53.79 228223 0.24 0.24 Xspllower
3.08 814.19 33.67 1925542 0.02 0.02 memset
...
0.13 1002.59 1.38 157329 0.01 0.01 genfs_putpages_flushbehind
...
0.06 1046.28 0.70 535569 0.00 0.00 genfs_putpages_pageout
...
0.01 1088.65 0.11 710253 0.00 0.00 genfs_putpages
so the extra system CPU time is being used by the processing for fsync().
I think that freebsd doesn't use so much CPU time for fsync() because in
their world, fsync() doesn't look for dirty pages, it looks for dirty buffers.
I can't find the code that does this right now, but I vaguely recall that
dirty pages are tracked in their buffer cache by B_DELWRI buffers, so they
can find the dirty pages via the dirty buffers, which are kept on a separate
list from the clean buffers.
however, as far as I can tell, linux does not track clean and dirty pages
separately such that one can find the dirty pages without looking through
the clean ones, and last I heard, solaris didn't either. yet both of them
had even better performance than freebsd in the article.
-Chuck