Subject: Re: CVS commit: src
To: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
From: Bill Studenmund <wrstuden@netbsd.org>
List: tech-kern
Date: 06/24/2004 12:15:18
--fdj2RfSjLxBAspz7
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Thu, Jun 24, 2004 at 09:35:59AM +0900, YAMAMOTO Takashi wrote:
> > Whatever access callers of the file system expected to be serialized th=
at=20
> > now aren't. i.e. a case where a caller called VOP_LOCK() and expected a=
n=20
> > exclusive lock is now in place. Especially if it expected that lock to =
be=20
> > held across a call to ltsleep().
>=20
> yes, we have such callers, unfortunately.
> they should be fixed eventually, IMO.

If we say, "We do X," how exactly are callers broken to assume that we=20
actually _do_ X? Ok, I get that you wish it were otherwise. And I also get=
=20
that things aren't well documented, so it's not necessarily clear that=20
we're saying, "We do X." But there is a method to what we do and, for the=
=20
most part, it is followed.

"Fixed," means (to me) things are "broken." I'm not seeing "broken" in
this thread. "Different" form what we may like, and different from what we
might do were we to start again. But I'm not getting what validates the
judgement to get us to "broken."

My big concern is that if we aren't clear in our goals for changing
things, then we will end up in a muddle, and a muddle in the file system
code will be a killer. Paraphrasing a bit, I'm hearing (from a number of
folks, not just Yamamoto-san), "I don't like Y," or "Z should be
different." I'm not hearing, "I want to do X, but can't because of Y so I
want to change Y to Z." Those comments don't sound like the right way to
start out changing things.

> > Also, for things like delete and rename, would it be so easy? Or file=
=20
> > creation?
>=20
> yes, it's easy.  see nfs client.

You and Gordon missed my point. I'm not saying that it will always be
hard; I'm sorry if I implied it would be. I agree that something like NFS,
where we send an RPC off to a different server, can handle this easily.

However it _can_ be hard for the fs. Consider a local file system. A
delete involves lookup up the name, then deleting the entry. If some other
operation comes in and changes the directory in the middle (between
operations), then we can't simply proceed with the deletion. We need to
redo the lookup.

So we either lock the involved node(s) between the lookup and delete, or
we let the delete code cope with a potential change between the lookup and
when it starts, or we rig up a way the delete code returns a, "restart the=
=20
whole sequence," error code. Then whatever does the delete releases all=20
the vnodes, and restarts (redoes the lookup then redoes the delete call,=20
like a RAS).=20

For leaf file systems, redoing the lookup would be a pain. It would also
be extra code for a case that doesn't happen much. Restarting the call is
not deterministic, and, for something as heavy as node deletion, it is
really not what you want to do in a busy environment; it's much better to
have threads sleep for some lock thus acting sequentially. So locking
between the lookup and the delete is a win. It's also what we do now. :-)

So the next question is, if we do the lookup but not the delete, how do we=
=20
clean up? We have to undo the locking somehow. Berkeley decided to make=20
locking visible to the callers of the VOP routines, and to make them clean=
=20
up the nodes and locks. Thus our current locking methodology.

> > > >From VOP_LOCK(9)
> > >     VOP_LOCK() is used to serialise access to the
> > >     file system such as to present two writes to
> > >     the same file from happening at the same time.
> > >=20
> > > Why? Is this a semantic of the file model? Or is
> > > this a context to make things "easier" on the
> > > underlying file system?
> >=20
> > It is file system semantics. A call to write(2), barring errors, is=20
> > supposed to be atomic. Thus if you have two write calls that overlap, t=
he=20
> > overlapping data are to have come from one call or the other, not some =
mix=20
> > of both.
>=20
> because we use a single VOP_WRITE for write(2),
> there's no need to expose vnode lock to upper layer for this reason.

The "need" for this for VOP_WRITE() is a design decision. For things like
deleting and renaming, as above, we need some sort of locking. And
VOP_LOOKUP() has to be in on it too. The question is, do we use the same
locking for VOP_WRITE(), or do we do some other sort of locking. If we did
some other sort of locking (a lock different from the lookup/delete/
create/rename lock), then we could have VOP_WRITE() do all its locking
internally.

Berkeley decided not to do that. And to be honest, their choice seems
quite sensible. Just have one lock, and let the callers help out with=20
locking operations.

For VOP_WRITE, I don't think it would really make that much difference.=20
Right now, the caller does a VOP_LOCK(), some stuff which doesn't look to=
=20
take that long (cycles), VOP_WRITE(), some small amount of stuff, then=20
VOP_UNLOCK().

If we didn't do the locking externally, we pretty much lock as soon as the=
=20
write starts, and unlock before exiting. Seems about the same to me. Yes,=
=20
we spend a few cycles fewer w/o the lock, but I don't think that, in and=20
of itself, is much of a difference.

Take care,

Bill

--fdj2RfSjLxBAspz7
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (NetBSD)

iD8DBQFA2yhGWz+3JHUci9cRAk0TAJ0QGvmBbNmp6oCIup6KQDsOIVI+WwCfVjfW
sb5GfVfWUrHVNPyFI4lMlHg=
=MHd4
-----END PGP SIGNATURE-----

--fdj2RfSjLxBAspz7--