Subject: Re: loaning for read() of regular files
To: Frank van der Linden <fvdl@netbsd.org>
From: Chuck Silvers <chuq@chuq.com>
List: tech-kern
Date: 02/19/2005 09:47:31
On Tue, Feb 15, 2005 at 06:52:34PM +0100, Frank van der Linden wrote:
> This will probably also make the bonnie "rewrite" case perform much better.

I doubt it.  here's a little treatise on this topic that I wrote recently
in response to soda's query about my old mail discussing ways to optimize
writes to already-allocated parts of a file, such as databases do.
I misunderstood him and thought he was talking about bonnie "rewrite",
which, as I describe, is a different issue.

----------------------------------------------------------------------

sure, the database may be triggering read-modify-write, but as I recall,
the "rewrite" pattern is different.  ... and I just looked at the
bonnie code again to verify.

bonnie doesn't do random writes (which might trigger RMW internally).
it does an explicit read() and write() of each 16k chunk of the file
sequentially.  databases don't tend to do that, and there's no way to
avoid reading the old data in this case since we have to give it to
the application.  I think this pattern does especially badly because
the read-ahead and flush-behind disk I/Os interfere with each other.

now if the database is only trying to do writes and not reads, and if
the write offsets are page-aligned, then we could safely buffer the
writes in the kernel without reading the old data if we knew that no one
had access to the pages via mmap().  if the write offsets are merely
sector-aligned, then we could still do that, but it would require some
more work (to track which parts of a page are valid).

but for databases (well, databases that do their own caching in user space,
which most databases do), it's almost always better to not buffer I/O in
the kernel at all and just send writes straight to the disk.  this can also
be done with sector-aligned file offsets, and it requires much of the same
infrastructure as the earlier optimization (knowing whether or not the file
is mmap()'d).



I tried an experiment just now to test my theory on why bonnie rewrite is
slow:  I ran bonnie against file systems with different mount options,
"softdep" and "async" respectively.  the important difference between these
mount options is that "softdep" (or actually any mount that's not "async")
will start writing dirty pages to disk asynchronously on 64k file offset
boundaries (known as "flush behind"), whereas "async" does not start
disk writes based on have dirtied any particular amount of memory.
in the "async" case, the pages are flushed to disk by the syncer thread.

here are the results:

bonnie on p3:/build/tmp (softdep)
              -------Sequential Output-------- ---Sequential Input-- --Random-- 
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- 
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU 
         5000 27241 58.2 28125 23.7  9767  9.4 32248 60.7 32771 21.4 176.9  1.8 
146.484u 185.530s 20:36.81 26.8%        0+0k 51+437io 0pf+1w


bonnie on p3:/build/tmp (async)
              -------Sequential Output-------- ---Sequential Input-- --Random-- 
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- 
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU 
         5000 17945 33.3 23326 15.4 15186 15.9 32842 61.8 33073 21.6 183.2  1.9 
145.408u 169.002s 19:41.33 26.6%        0+0k 50+789io 0pf+1w


I ran "iostat -x" while these were running, and the patterns during the
"rewrite" portion were as I expected:

softdep:
device  read KB/t    r/s   time     MB/s write KB/t    w/s   time     MB/s
...
wd0         64.00    170   1.00    10.64      64.00    171   1.00    10.71
wd0         64.00    173   1.00    10.83      64.00    173   1.00    10.83
wd0         64.00    170   1.00    10.64      64.00    169   1.00    10.58
wd0         64.00    167   1.00    10.46      64.00    167   1.00    10.46
wd0         64.00    177   1.00    11.08      64.00    178   1.00    11.14
wd0         64.00    170   1.00    10.64      64.00    169   1.00    10.58
wd0         64.00    167   1.01    10.46      64.00    167   1.01    10.46
wd0         64.00    170   0.97    10.64      64.00    170   0.97    10.64
...


async:
...
wd0         64.00    238   1.00    14.85      63.32    255   1.00    15.80
wd0         64.00    581   1.01    36.32      16.00      1   1.01     0.02
wd0         63.88    502   0.87    31.31      16.00      1   0.87     0.01
wd0          0.00      0   0.80     0.00      64.00    406   0.80    25.37
wd0          0.00      0   0.98     0.00      63.75    509   0.98    31.68
wd0          0.00      0   1.01     0.00      64.00    259   1.01    16.22
wd0         64.00    402   1.00    25.12      62.84    164   1.00    10.09
wd0         64.00    571   1.00    35.71      16.00      1   1.00     0.02
wd0         64.00    377   0.67    23.58      64.00      6   0.67     0.37
wd0          0.00      0   1.00     0.00      64.00    531   1.00    33.17
wd0          0.00      0   0.99     0.00      63.88    517   0.99    32.24
wd0         63.73    239   1.01    14.85      63.84    300   1.01    18.70
wd0         64.00    565   1.00    35.33      16.00      2   1.00     0.03
...


so to improve the "rewrite" case, what we actually need to do is increase
the I/O sizes that we read ahead and flush behind (currently 64k for each)
so that we don't switch back and forth between reading and writing so often.
and now that I think about it, this "rewrite" pattern is pretty similar to
that of a more common application: copying a file where the source and
destination are on the same physical disk.  so improving this case, while
not especially interesting in itself, will likely speed up copying files
as well.


----------------------------------------------------------------------


-Chuck