Subject: direct I/O
To: None <tech-kern@netbsd.org>
From: Chuck Silvers <chuq@chuq.com>
List: tech-kern
Date: 02/28/2005 00:10:36
hi folks,

since my profiling experiments with the 10M Rows mysql benchmark showed that
avoiding read-before-overwrite didn't help much at all, this morning I decided
to go for the real solution and look into unbuffered (aka direct) I/O.
implementing this turned out to be much easier than I expected, it only
took a few hours to get it working.  the diff is at:

	ftp://ftp.netbsd.org/pub/NetBSD/misc/chs/diff.directio

(I need to add a check to prevent it from invalidating wired pages,
but that's the only thing that I've thought of that's missing.)

this is safer than the loaning-for-read()/write() stuff, since applications
must explicitly ask for the new behaviour and the interface to do this is
fairly standard already.  the interface is just a new open/fcntl flag,
O_DIRECT, which is a hint that I/O done via this file descriptor be performed
without buffering in the kernel as much as possible.  there are a number of
reasons why it may not be possible to do an I/O without kernel buffering
(eg. because the request is not appropriately aligned on disk or in memory),
and those cases the hint is ignored.

of course, with direct I/O being intrinsically synchronous, that immediately
leads to the question of how to allow concurrent reads and writes to a file.
our current locking scheme (enforced above the VOP layer) doesn't have any
concept of this.  I'm thinking about adding a range-locking implementation
and having vn_{read,write}() use that for O_DIRECT requests where
FOF_UPDATE_OFFSET is not set (ie. pread() and pwrite()).  this would involve
some new VOPs (VOP_LOCK_RANGE(), VOP_UNLOCK_RANGE()) and those would need to
interact appropriately with the existing non-ranged vnode locks.  file systems
that do not implement these VOPs would just return an error and the calling
layer would fall back on the current locking automatically.  there would
be no syscall interface change for this.  I'm thinking it probably won't be
much harder to do the range-locks than it was to do the direct I/O stuff.

comments?

-Chuck