Subject: Re: direct I/O
To: Darrin B.Jewell <dbj@netbsd.org>
From: Chuck Silvers <chuq@chuq.com>
List: tech-kern
Date: 03/03/2005 17:37:35
On Thu, Mar 03, 2005 at 07:03:10PM -0500, Darrin B.Jewell wrote:
> 
> Chuck Silvers <chuq@chuq.com> writes:
> > of course, with direct I/O being intrinsically synchronous, that immediately
> > leads to the question of how to allow concurrent reads and writes to a file.
> > our current locking scheme (enforced above the VOP layer) doesn't have any
> > concept of this.  I'm thinking about adding a range-locking implementation
> > and having vn_{read,write}() use that for O_DIRECT requests where
> > FOF_UPDATE_OFFSET is not set (ie. pread() and pwrite()).  this would involve
> > some new VOPs (VOP_LOCK_RANGE(), VOP_UNLOCK_RANGE()) and those would need to
> > interact appropriately with the existing non-ranged vnode locks.  file systems
> > that do not implement these VOPs would just return an error and the calling
> > layer would fall back on the current locking automatically.  there would
> > be no syscall interface change for this.  I'm thinking it probably won't be
> > much harder to do the range-locks than it was to do the direct I/O stuff.
> > 
> > comments?
> 
> My first thought is that for internal kernel use, a separate
> VOP_LOCK_RANGE call should not be necessary.  Instead, I was thinking
> that the existing PG_BUSY locks could be relied on for individual
> pages in a range.  I don't think range locking is necessary at this
> level, since consistent results for concurrent access to a range is
> not required as far as I am aware.  (Except in the case of
> fcntl(F_SETLK), which is handled separately.)
> 
> In fact, I would like to relax vnode locks, so that they don't lock
> out concurrent access to a file, and instead just protect the
> integrity of filesystem metadata where necessary.

POSIX requires that for regular files, read() and write() be atomic with
respect to each other (and write()s with respect to other write()s),
so we do need to keep enough locking to provide that.  however, there no
requirement that non-overlapping write()s be serialized, so tracking
the ranges are being read or written allows greater concurrency than
the current scheme of just tracking read vs. write.

(see my response to bill's mail for more details on our fs locking.)


> I'll also note that currently, i/o operations on the VCHR device are
> already allowed to be concurrent because the vnode is unlocked in
> spec_write when the underlying device write routine is called.  While
> I think this is appropriate behavior, this leads to an existing race
> condition in the physio uvm_vslock/uvm_vsunlock code which causes
> diagnostic messages to be spewed from the i386 pmap_unwire routine.
> If useful, I can reproduce a test case and submit a pr.

there is no requirement that reads and writes on devices be serialized,
and VOP_LOCK() for device nodes is already a no-op.  I'm aware of the
problems with uvm_vslock(), but fixing them is probably going to be
somewhat involved.

-Chuck