Subject: Re: CVS commit: src/sys
To: Bill Stouder-Studenmund <wrstuden@netbsd.org>
From: Andrew Doran <ad@netbsd.org>
List: tech-kern
Date: 07/31/2007 11:38:22
On Mon, Jul 30, 2007 at 04:01:27PM -0700, Bill Stouder-Studenmund wrote:

> > I think this is a good stopgap for fixing NetBSD 4, but also that it's the
> > wrong way to solve the problem. The issue is that there are two threads
> > trying to deactivate the vnode, and eventually one goes on to reclaim. That
> > can be prevented by continuing to hold a reference to the vnode across the
> > call to deactivate it. Once it's deactivated (and we've checked it doesn't
> > need to be re-deactivated due to a ref in the meantime) we should go on to
> > lock it and reclaim it. The premise being that no-one should try to lock a
> > vnode without first holding a reference to it.
> 
> Do you have references describing this issue? I'm not sure where it'd 
> occur.

See the traceback here. The syncer is trying to lock the vnode in order to
deactivate. Meanwhile lwp_0 deactivates, barges ahead and reclaims. Both
threads believe that they hold the last reference:

    http://mail-index.netbsd.org/current-users/2007/07/23/0027.html

> The layering issue is that we export the struct lock to upper layer file 
> systems. This is in part an efficiency concern and part an implementation 
> (a simplistic implementation) of Heidemann's caching ideas (see his 
> disseration which I think I have as a .ps file if you need).

This one: "A Layered Approach to File System Development"?
 
> > On the other hand, maybe it's time we looked at pushing the locking back
> > into the file systems and being more strict about the reference counting, so
> > that these kinds of issues don't continue to appear. :-). What do people
> > think of the idea? Some if the issues involved in doing that I know of are..
> 
> I used to dislike the idea, but I now like it. I think it's the right way 
> to go in the long run. Well, I think we may still need 
> VOP_LOCK()/VOP_UNLOCK(), but they'd be much more like the SVR4 versions - 
> you only use them when you need to make an atomic sequence of a number of 
> VOPs. Most uses just do all locking internally.
>
> > o VOP_LOCK and VOP_UNLOCK would need to go away. To make VOP calls you'd
> >   only need a reference to the vnode.
> 
> See above. I mostly agree.

My first thought is that file systems would (simplistically) do something
like this at the start of each VOP:

	rw_enter(&ip->i_lock, RW_READER);
	if (ip->i_gone) {
		/* revoked, but there are still references */
		rw_exit(&ip->i_lock);
		return ENOENT;
	}

We could make those genfs methods if the fs doesn't want to do something
more clever. To allow the lock to be held across multiple VOPs as something
that can be done generally we'd need to allow recursion or similar, right?
Or are you thinking of something different?

> > o We'd need a VOP_RWLOCK or VOP_RANGELOCK for concurrent reads/writes.
> 
> Nah, just have the FS do this internally when an operation comes in.

That occurred to me too - agreed.
 
> > o The calls to VOP_LEASE look suspicious, since they are made and then
> >   followed by other VOP calls under the same lock. VOP_LEASE doesn't appear
> >   to do anything these days though.
> 
> I think we've dropped NQNFS support, so it doesn't. We should check that 
> NFSc4 implementations don't need it, though.

I'll investigate.
 
> > o There are some place where VOP_ACCESS is followed by a another call into
> >   the file system. Those checks would need to go into the file system.
> 
> Not necessarily.
> 
> I think the real test is to compare the # of call points that have 
> VOP_ACCESS() now with the number of internal calls we'd have if we pushed 
> it in. # callers now vs. # file system types * # VOPs that need these 
> checks internalls. Note I'm assuming the FSs don't implicitly still do 
> VOP_ACCESS() checks anyway now.

They're not as widespread as I thought. The main two offenders seem to
MNT_SYMPERM and the following. I'll look more thoroughly when I get a
moment.

        else if ((error = vn_writechk(vp)) == 0 &&
            (error = VOP_ACCESS(vp, VWRITE, l->l_cred, l)) == 0) {
                VATTR_NULL(&vattr);
                vattr.va_size = SCARG(uap, length);
                error = VOP_SETATTR(vp, &vattr, l->l_cred, l);
        }

> > o To eliminate v_interlock's part in locking a vnode, it would be difficult
> >   to be sure that we have completely reclaimed vnodes from a file system -
> >   because we can't be absolutely sure at which point calls stop being made
> >   into foo_vnops and start being made into dead_vnops. That would mean:
> >   after a forced unmount the file system code would need to stick around
> >   (not be unloaded) while there are still vnodes that it spawned. I guess we
> >   could add a reference count to the vnode to count the number of LWPs
> >   currently within a VOP and poll until the figure drops to zero for all the
> >   remaining vnodes. That's a lot of additional atomic ops for the system to
> >   do, though.
> 
> I don't see how v_interlock matters here. I agree though that we need to 
> do something to deal with killing concurrent VOPs when wiping out a vnode.

I think I worded that badly. The issue I see is that access to vp->v_op
would only be gated by vget()/vrele(). So you have thread 1 doing the
following during revoke:

	vp->v_op = dead_vnodeop_p;

Thread 2 still has a reference to the vnode and does a VOP_FOO() which
translates into something like:

	func = vp->v_op[operation_number];
	(*func)(a, b, c, d);

There's isn't a cheap and easy way to be sure that thread 2 is going to call
foo_read or dead_read. Between the two statements it could get preempted and
not run again for 10 seconds, or it could hang around in foo_read for a
while. So that's why I said if filesystem X spawns a vnode, you wouldn't
be able modunload it until all those vnodes had completely died - it would
need to stick around to reject the requests.

It occured to me that we could turn things on their head and go through the
list of open files / mounts and gate things though FILE_USE() and similar,
so that we can safely rip the vnode reference out from under whoever is
using it. That seems very ugly.

> Draining on some count may be what we'd have to do...

If we got the memory ordering right we could certianly do something like the
following and wake up periodically to poll. It could cost hundreds more
cycles per VOP which is why I'm keen to avoid it.

VOP_FOO()
{

	atomic_inc_uint(&v->v_op->vop_usecount);
	... do it ..
	atomic_dec_uint(&v->v_op->vop_usecount);
}

Maybe this is all too much effort. Forced unmounts are really useful but is
forced modunload a feature that's of general interest?

Thanks,
Andrew