tech-kern: Re: direct I/O

Subject: Re: direct I/O
To: Bill Studenmund <wrstuden@netbsd.org>
From: Chuck Silvers <chuq@chuq.com>
List: tech-kern
Date: 03/03/2005 17:30:48
On Thu, Mar 03, 2005 at 01:27:43PM -0800, Bill Studenmund wrote:
> On Mon, Feb 28, 2005 at 12:10:36AM -0800, Chuck Silvers wrote:
> > 
> > of course, with direct I/O being intrinsically synchronous, that immediately
> > leads to the question of how to allow concurrent reads and writes to a file.
> > our current locking scheme (enforced above the VOP layer) doesn't have any
> > concept of this.  I'm thinking about adding a range-locking implementation
> > and having vn_{read,write}() use that for O_DIRECT requests where
> > FOF_UPDATE_OFFSET is not set (ie. pread() and pwrite()).  this would involve
> > some new VOPs (VOP_LOCK_RANGE(), VOP_UNLOCK_RANGE()) and those would need to
> > interact appropriately with the existing non-ranged vnode locks.  file systems
> > that do not implement these VOPs would just return an error and the calling
> > layer would fall back on the current locking automatically.  there would
> > be no syscall interface change for this.  I'm thinking it probably won't be
> > much harder to do the range-locks than it was to do the direct I/O stuff.
> 
> I think I'd rather push this into the file system. I agree with the 
> discussions we've had that in the long run it'd be nice to move to a 
> different vnode locking scheme, and as part of that VOP_READ() and 
> VOP_WRITE() calls would be performed w/o holding the vnode lock. In that 
> case, the fs has to do all the locking internally. But I think it'd be 
> easy to handle something like you describe here: on entry, VOP_READ() or 
> VOP_WRITE() grabs some sort of range lock for its i/o, performs it, then 
> releases the range lock. For READ, we let the locks be shared and for 
> WRITE, exclusive.

yea, moving all the fs locking under the VOP layer would be good in the
long run.  I was trying to keep these direct-I/O-relates changes as
low-impact as possible since I want to backport them to 2.x (and now 3.x,
since I guess I'm too late for 3.0 itself).  doing it this way would affect
more places, but I'd think it would still be managable for backporting.

as long as you're only talking about changing the locking rules for
VOP_READ() and VOP_WRITE() and not all the other VOPs, that's ok with
me.  a brief looks shows that VOP_READDIR() and VOP_READLINK() are
sometimes implemented with VOP_READ(), so it would probably make sense
to change those at the same time, and it wouldn't be much more effort.


> The problem with VOP_LOCK_RANGE is, at least as it pops into my head, that 
> the file system still needs to do locking internally to protect metadata 
> structures. Before that was protected by the vnode lock, but now (with 
> either option), we can have multiple writes operating on the same file at 
> once. So the fs will have to be savy. Consider the case of two pwrite()s 
> to a sparse file, with each one of them allocating blocks. We need to make 
> sure both of those operations work right. :-)

actually, the way things are today, a file's block map is protected by
the getpages lock, not the VOP_LOCK().  this was part of the UBC rework,
since faults on a mapping needs to read the block map (and modify it
for write faults to holes), and we can't acquire the VOP_LOCK in this
path since we might already be holding it (or that of another vnode).
for allocating writes, the direct I/O code I posted earlier just punts
back to the buffered-write code.  so locking the block map isn't an issue.


> deal with two write operations happening at once, we can just let 
> VOP_WRITE() not need a lock and thus we won't need the _RANGE calls. :-)

we can't ever do away with locking entirely for VOP_READ and VOP_WRITE,
since POSIX requires that for file, read() and write() must be atomic
with respect to each other (and write() with respect to other write()s).

-Chuck