Subject: RE: CVS commit: src
To: Bill Studenmund <>
From: Gordon Waidhofer <>
List: tech-kern
Date: 06/24/2004 14:55:47
There is mostly loud agreement. Allow me to
loudly agree some more (mostly)...

Bill Studenmund wrote
> On Thu, Jun 24, 2004 at 09:35:59AM +0900, YAMAMOTO Takashi wrote:
> > > Whatever access callers of the file system expected to be serialized that 
> > > now aren't. i.e. a case where a caller called VOP_LOCK() and expected an 
> > > exclusive lock is now in place. Especially if it expected that lock to be 
> > > held across a call to ltsleep().
> > 
> > yes, we have such callers, unfortunately.
> > they should be fixed eventually, IMO.
> If we say, "We do X," how exactly are callers broken to assume that we 
> actually _do_ X? Ok, I get that you wish it were otherwise. And I also get 
> that things aren't well documented, so it's not necessarily clear that 
> we're saying, "We do X." But there is a method to what we do and, for the 
> most part, it is followed.

X isn't broadly known. For example, the description of VOP_LOCK()
says that it provides exclusion to prevent simultaneous ops. For
a file system that has no such need, one could infer that
VOP_LOCK() could be a noop. But that's not the case. Further,
one could construe that VOP_LOCK() need not use lockmgr() nor
operate on v_lock. But that's not the case. The reasons why
VOP_LOCK() has to work a certain way, X, isn't published.
> "Fixed," means (to me) things are "broken." I'm not seeing "broken" in
> this thread. "Different" form what we may like, and different from what we
> might do were we to start again. But I'm not getting what validates the
> judgement to get us to "broken."

I agree. It isn't broken.
> My big concern is that if we aren't clear in our goals for changing
> things, then we will end up in a muddle, and a muddle in the file system
> code will be a killer. Paraphrasing a bit, I'm hearing (from a number of
> folks, not just Yamamoto-san), "I don't like Y," or "Z should be
> different." I'm not hearing, "I want to do X, but can't because of Y so I
> want to change Y to Z." Those comments don't sound like the right way to
> start out changing things.

That sounds exactly right. Add to it, though, that because
certain aspects of the file model are not particularly visible,
it feels like what you are worried about happening has already
happened. I'm sure it hasn't.

VOP_LOCK() at one time, perhaps, was totally at the
discretion of the underlying file system. Through organic
evolution, now it really needs to be lockmgr()-ish. Is
that muddled? Probably not. It's just not particularly
visible. Is that murky?

> > > Also, for things like delete and rename, would it be so easy? Or file 
> > > creation?
> > 
> > yes, it's easy.  see nfs client.
> You and Gordon missed my point. I'm not saying that it will always be
> hard; I'm sorry if I implied it would be. I agree that something like NFS,
> where we send an RPC off to a different server, can handle this easily.
> However it _can_ be hard for the fs. Consider a local file system. A
> delete involves lookup up the name, then deleting the entry. If some other
> operation comes in and changes the directory in the middle (between
> operations), then we can't simply proceed with the deletion. We need to
> redo the lookup.

Well, see, that's blaming the underlying file system. Perhaps
originally VOP_LOCK() was optional and for file systems on
which LOOKUP/REMOVE sequences are "hard" it could work just
the way you say, and on file systems that don't have it "hard"
VOP_LOCK() could be a noop.

But VOP_LOCK() can't be a noop. It ain't the file systems' fault.
It something else: X.

> So we either lock the involved node(s) between the lookup and delete, or
> we let the delete code cope with a potential change between the lookup and
> when it starts, or we rig up a way the delete code returns a, "restart the 
> whole sequence," error code. Then whatever does the delete releases all 
> the vnodes, and restarts (redoes the lookup then redoes the delete call, 
> like a RAS). 
> For leaf file systems, redoing the lookup would be a pain. It would also
> be extra code for a case that doesn't happen much. Restarting the call is
> not deterministic, and, for something as heavy as node deletion, it is
> really not what you want to do in a busy environment; it's much better to
> have threads sleep for some lock thus acting sequentially. So locking
> between the lookup and the delete is a win. It's also what we do now. :-)

	$ man -k leaf
	leaf: nothing appropriate

I'll take your word for it, though.

> So the next question is, if we do the lookup but not the delete, how do we 
> clean up? We have to undo the locking somehow. Berkeley decided to make 
> locking visible to the callers of the VOP routines, and to make them clean 
> up the nodes and locks. Thus our current locking methodology.

Not all file systems work that way. Clean up what?
If LOOKUP finds an entry, then REMOVE does not, the
error is still ENOENT.

> > > > >From VOP_LOCK(9)
> > > >     VOP_LOCK() is used to serialise access to the
> > > >     file system such as to present two writes to
> > > >     the same file from happening at the same time.
> > > > 
> > > > Why? Is this a semantic of the file model? Or is
> > > > this a context to make things "easier" on the
> > > > underlying file system?
> > > 
> > > It is file system semantics. A call to write(2), barring errors, is 
> > > supposed to be atomic. Thus if you have two write calls that overlap, the 
> > > overlapping data are to have come from one call or the other, not some mix 
> > > of both.
> > 
> > because we use a single VOP_WRITE for write(2),
> > there's no need to expose vnode lock to upper layer for this reason.
> The "need" for this for VOP_WRITE() is a design decision. For things like
> deleting and renaming, as above, we need some sort of locking. And
> VOP_LOOKUP() has to be in on it too. The question is, do we use the same
> locking for VOP_WRITE(), or do we do some other sort of locking. If we did
> some other sort of locking (a lock different from the lookup/delete/
> create/rename lock), then we could have VOP_WRITE() do all its locking
> internally.

Perhaps. It's an FFS-centric argument. That's fair, valid, reasonable,
sensible, practical, etc. FFS is the primary file system. But the
argument, "FFS needs it so all file systems need it" won't hold.
FFS needs it. OK. It may lack an unimportant aesthetic. OK, too.
Constrainst on the VOP design and on VOP callers for the sake
of FFS really is OK. It is the dog wagging the tail.

But it can be confusing and can stymie other interesting
file system work. Mostly because VOP descriptions talk
abstractly, then reasonable inferences from those descriptions
do not hold.

> Berkeley decided not to do that. And to be honest, their choice seems
> quite sensible. Just have one lock, and let the callers help out with 
> locking operations.
> For VOP_WRITE, I don't think it would really make that much difference. 
> Right now, the caller does a VOP_LOCK(), some stuff which doesn't look to 
> take that long (cycles), VOP_WRITE(), some small amount of stuff, then 
> If we didn't do the locking externally, we pretty much lock as soon as the 
> write starts, and unlock before exiting. Seems about the same to me. Yes, 
> we spend a few cycles fewer w/o the lock, but I don't think that, in and 
> of itself, is much of a difference.

Probably it isn't. But, consider this...

Say 100 lookups on a particular directory enter the
system at the same time. Suppose 50 of them will have
to wait for I/O, and 50 won't. There are approaches to
file systems (like mine) that will satisfy the 50 that
need no I/O before the first I/O is done. But, VOP_LOCK()
assures that something will have to wait for 50 I/Os it
doesn't care about.

Now, consider NFS. 100 lookups could be issued to the
server before the first reply is received. Because of
VOP_LOCK(), we are assured that the 100th lookup will
wait 99 roundtrip latencies.

It's not a practical example, of course.

Nothing is broken. But X, whatever it is, does seem
to be in the way of those sorts of scenarios. For us
poor add-on folks who can't live and breath this stuff,
it would sure be more comfortable to know about X.

In the end, knowning that VOP_LOCK() is mandatory
and lockmgr()-ish got me through what I needed to
do without ever understanding fully. That's what
interested me in this thread. It seemed like the
reasons why I can't overlap lookups, fer instance,
were about to become visible. But, alas...

And, again, nothing is broken.

> Take care,
> Bill