Subject: Re: direct I/O
To: Chuck Silvers <chuq@chuq.com>
From: Bill Studenmund <wrstuden@netbsd.org>
List: tech-kern
Date: 03/04/2005 08:31:19
--17pEHd4RhPHOinZp
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Thu, Mar 03, 2005 at 05:30:48PM -0800, Chuck Silvers wrote:
> On Thu, Mar 03, 2005 at 01:27:43PM -0800, Bill Studenmund wrote:
> > I think I'd rather push this into the file system. I agree with the=20
> > discussions we've had that in the long run it'd be nice to move to a=20
> > different vnode locking scheme, and as part of that VOP_READ() and=20
> > VOP_WRITE() calls would be performed w/o holding the vnode lock. In tha=
t=20
> > case, the fs has to do all the locking internally. But I think it'd be=
=20
> > easy to handle something like you describe here: on entry, VOP_READ() o=
r=20
> > VOP_WRITE() grabs some sort of range lock for its i/o, performs it, the=
n=20
> > releases the range lock. For READ, we let the locks be shared and for=
=20
> > WRITE, exclusive.
>=20
> yea, moving all the fs locking under the VOP layer would be good in the
> long run.  I was trying to keep these direct-I/O-relates changes as
> low-impact as possible since I want to backport them to 2.x (and now 3.x,
> since I guess I'm too late for 3.0 itself).  doing it this way would affe=
ct
> more places, but I'd think it would still be managable for backporting.

Ahhh... That makes sense. For the back-porting case, I think the _RANGE()=
=20
calls are fine.

> as long as you're only talking about changing the locking rules for
> VOP_READ() and VOP_WRITE() and not all the other VOPs, that's ok with
> me.  a brief looks shows that VOP_READDIR() and VOP_READLINK() are
> sometimes implemented with VOP_READ(), so it would probably make sense
> to change those at the same time, and it wouldn't be much more effort.

Ok. Eventually I'd like all the ops to change, but that's a lot of work I=
=20
don't have time to do now. :-)

> > The problem with VOP_LOCK_RANGE is, at least as it pops into my head, t=
hat=20
> > the file system still needs to do locking internally to protect metadat=
a=20
> > structures. Before that was protected by the vnode lock, but now (with=
=20
> > either option), we can have multiple writes operating on the same file =
at=20
> > once. So the fs will have to be savy. Consider the case of two pwrite()=
s=20
> > to a sparse file, with each one of them allocating blocks. We need to m=
ake=20
> > sure both of those operations work right. :-)
>=20
> actually, the way things are today, a file's block map is protected by
> the getpages lock, not the VOP_LOCK().  this was part of the UBC rework,
> since faults on a mapping needs to read the block map (and modify it
> for write faults to holes), and we can't acquire the VOP_LOCK in this
> path since we might already be holding it (or that of another vnode).
> for allocating writes, the direct I/O code I posted earlier just punts
> back to the buffered-write code.  so locking the block map isn't an issue.

Ok! That's good to know.

> > deal with two write operations happening at once, we can just let=20
> > VOP_WRITE() not need a lock and thus we won't need the _RANGE calls. :-)
>=20
> we can't ever do away with locking entirely for VOP_READ and VOP_WRITE,
> since POSIX requires that for file, read() and write() must be atomic
> with respect to each other (and write() with respect to other write()s).

I'm sorry, I meant "not need a[n external] lock"; there won't be a need=20
for an external (vnode) lock. I agree we will need internal locking.

I also expect we will need some sort of locking on the file descriptor,=20
since offset-based access (read(2), write(2) as opposed to pread(2),=20
pwrite(2)) have to be serialized.

Take care,

Bill

--17pEHd4RhPHOinZp
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (NetBSD)

iD8DBQFCKI1XWz+3JHUci9cRAgGqAJ9eyb5gHQfRl0g27/0JvyeZcH+4IQCgg9Yr
oDbRlfc8v2XB01w3fb2sVB8=
=xTFP
-----END PGP SIGNATURE-----

--17pEHd4RhPHOinZp--