tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: rototilling the vnode life cycle



   Date: Fri, 28 Feb 2014 15:02:41 +0100
   From: J. Hannken-Illjes <hannken%eis.cs.tu-bs.de@localhost>

   On Feb 27, 2014, at 7:16 PM, Taylor R Campbell 
<campbell+netbsd-tech-kern%mumble.net@localhost> wrote:

   I'm currently short on time but here are some comments and remarks.

Thanks!

   First, using an explicit vnode state, v_interlock to inspect or modify,
   looks good.

Actually, I think I may want to change that, so that vget can be done
under a pserialized read -- i.e., atomics only, no locks or waits.  I
took a look at dholland's message from September and started to write
a generic inode->vnode cache with pserialize, and realized vget needs
to happen under it.  But at least the explicit states will help to
clarify the model for now, and will be a good fallback if I give up
before making a pserialized inode->vnode cache.

   > 1. Revocation and destruction are confused.  Reclamation is the only
   > thing common to both operations.  Deadfs is for revocation; there is
   > no reason to use it for destruction.

   I suppose you mean using it for the short time after VOP_RECLAIM()
   and before vn_free().  It is still better than using the operations
   vector of the file system it has been removed from or using a NULL
   vector.

There's no reason for anyone else to even see the vnode after
VOP_RECLAIM returns.  If that's a possibility, it's a serious bug, so
panicking or trapping on a null pointer is reasonable.

   > 3. I can't prove there aren't races in deactivation and reclamation
   > because the code paths are so byzantine, and it's not clear what
   > zombie cases file systems have to handle.  (See, e.g., the bizarre
   > conditional call to VOP_INACTIVE just before VOP_RECLAIM in vclean.)

   Bizarre?  "Always close the vnode except it is an active block device
   that is mounted" is quite reasonable, what would you expect from a
   mounted file system that lost its device?

There were too many conditionals and possible code paths leading there
for me to disentangle them.  What was bizarre to me was that sometimes
vfs would call VOP_UNLOCK, and sometimes vfs would call VOP_INACTIVE
and ignore the file system's decision of whether to keep it around.
Anyway, that wasn't the crux of the problem with vclean.

   > Instead, once we decide to destroy a vnode, it should be final, and
   > file systems shouldn't have to handle races in making and effecting
   > that decision.

   Thats the way it works now.  With the one exception cited above a vnode
   reaching vclean will be destroyed (maybe becoming a dead node if it is
   active).

vclean is confusing because it does (some subset of) four different
things:

1. flush buffers and do vfs/uvm accounting,
2. ask the file system to destroy its state,
3. change the vnode's identity into a deadfs vnode, and/or
4. free the vnode.

Moreover, there may be more than one call in progress simultaneously,
and if you run twice in a row it is expected to be idempotent, which
means the callers are confused.  So I wanted to disentangle these into
separate routines which are all used exclusively and only when callers
are actually sure they want them to be used.

   >  * Vnodes exist in one of six states:
   >  *
   >  * - UNINITIALIZED
   >  * - ACTIVE
   >  * - REVOKING
   >  * - DEACTIVATING
   >  * - INACTIVE
   >  * - DESTROYING

   Is there a dead state missing?

Nope.  After DESTROYING it gets freed.  The protocol is this:

1. When vfs decides to destroy a vnode, it enters DESTROYING.
2. During this time, vget returns ENOENT for it.
3. VOP_RECLAIM unhooks it from all data structures (e.g., ufs_ihash).
4. Once VOP_RECLAIM returns, nobody else will even try to vget it.
5. vfs vnfrees the vnode.

That said, I made vnfree change its state to UNINITIALIZED so that
anything trying to use-after-free will crash.

   >  *
   >  * To get a file for a vnode on disk, a file system will do this:
   >  *
   >  * 1. Look it up in inode table; if there, try vget and return.
   >  * 2. If not there or if vget fails, getnewvnode and allocate data
   >  *    structures needed by the file system.
   >  * 3. Look it up again in inode table; if there, try vget,
   >  *    ungetnewvnode, free data structures, and return.
   >  * 4. If not there or if vget fails, initialize the new vnode, call
   >  *    vnode_ready, and store it in the inode table.

   This could be done better -- see David Hollands "fixing the vnode
   lifecycle" form Sep 22, 2013.

Well, this is the right structure.  As is, we write (something close
to) this for each file system; we can move a single copy of it into
what dholland called the vncache, which will do basically the same
thing.

   >  * If you have an ACTIVE vnode, you can revoke access to it with
   >  * vrevoke, usually called by VOP_REVOKE.  This causes it to enter the
   >  * REVOKING state temporarily while the file system's VOP_RECLAIM is
   >  * called, and then it changes identity to an ACTIVE vnode in deadfs.

   Better use an explicit DEAD state here.

Adding a separate state for revoked vnodes isn't actually necessary.
They otherwise behave just like normal vnodes in a funny file system,
and the active/inactive/destroying/&c. states all make sense for them.

   > int
   > vget(struct vnode *vp, int flags)
   > ...
   >    case VS_UNINITIALIZED:
   >            vnpanic(vp, "vget uninitialized vnode");

   As it is on the mount list this looks wrong.

I intended that you shouldn't touch anything on the mountlist for
which vnode_ready_p returns false.  Perhaps there are applications
that need to wait for the transition from UNINITIALIZED to anything
else, in which case we'd just need to add a cv_broadcast to
vnode_ready, although I think it would be better if we deferred
putting vnodes on the mount list until they're actually ready.

   > 
   >    case VS_ACTIVE: {
   >            unsigned int uc;
   > 
   >            do {
   >                    uc = vp->v_usecount;
   >                    KASSERT(0 < c);
   >                    if (uc == UINT_MAX)
   >                            return EBUSY;
   >            } while (atomic_cas_uint(&vp->v_usecount, uc, (uc + 1)) != uc);

   A simple atomic_add should be sufficient -- we hold the interlock here
   so the state cannot change.

I wanted to ensure it returns failure in case of overflow instead of
panicking (or silently proceeding, as it always does currently).

   >    case VS_REVOKING:
   >    case VS_DESTROYING:
   >            return ENOENT;

   Return without waiting looks bad.

I think this is actually OK.  If you're a file system, and you're
trying to grab one of your own vnodes, it's no good to get one whose
identity will soon be changed to deadfs -- you need to make a new
vnode for the inode you were looking for.  Nor is it any good to try
to work with one that vfs has chosen to destroy -- it's too late to
recover that one.  So in both cases immediately returning ENOENT is
the right thing.

   > void
   > vrele(struct vnode *vp)
   > {
   > 
   >    vrele_with(vp, &vnode_deactivate);

   Take v_interlock.

vrele_with takes it.

I omitted the definition of atomic_dec_uint_lock_if_zero, but
basically it atomically decrements values >2 and returns true, or
locks and atomically decrements 1 to 0 and returns false.  So if
you're not the last user, all you do is drop the usecount.  If you are
the last user, you get the interlock and you drop usecount to 0.

It's equivalent to the vtryrele logic we have now, but packaged up
more nicely.

   > int
   > vn_lock(struct vnode *vp, int flags)
   > {
   >    int error = 0;
   > 
   >    VOP_LOCK(vp, flags);
   > 
   >    mutex_enter(vp->v_interlock);
   >    if (__predict_false(vp->v_state == VS_REVOKING) ||
   >        (vp->v_op == dead_vnodeop_p))
   >            error = ENOENT;
   >    else
   >            KASSERT(vp->v_state == VS_ACTIVE);
   >    mutex_exit(vp->v_interlock);
   > 
   >    if (error)
   >            VOP_UNLOCK(vp);

   This doesn't work as VOP_UNLOCK() may or will use a different
   operations vector than VOP_LOCK() above.  There was a reason for
   my change to VOP_LOCK() some days ago.

Hmm.  This vn_lock code may be broken.  But revocation is currently
also broken because it doesn't exclude other vnode operations, and if
we fixed it in the obvious way so that VOP_REVOKE or vrevoke took the
vnode lock, then xyzfs and deadfs would have to use genfs_lock anyway
in order for it to all work out.  So even if we use the LK_RETRY hack
I believe there's more that we'll need to fix to enable revocation to
mix with non-genfs_lock VOP_LOCKs.

In any case, however we implement the routines, my point here was that
file systems should use vn_lock to lock their own vnodes, and check
for failure.  System calls should use vn_lock_deadok which will always
succeed.

   >     * XXX Pointer writes must be atomic; otherwise concurrent
   >     * VOP_* trying to use v_op will fail horribly.

   It is not a problem with atomic writes -- vnode operations working
   on unlocked vnodes may still be running and this is horrible.

Yeah -- VOP_REVOKE obviously needs to take the vnode lock.  The atomic
write issue is that VOP_LOCK reads v_op unlocked.

   Revoking a device node must revoke all aliases.

...Pooh!

   Vrecycle() has to change -- I'm currently working on it.

What is it supposed to do?  I don't understand from the man page
description or the uses in-tree.

Also, I propose we nix the terms `recycle' and `reclaim' (except for
VOP_RECLAIM, until we rename it to VOP_DESTROY).  When we're done with
a vnode, we destroy it and free its memory.  When we want to grab it
from the inode cache, we get it, or perhaps grab it.  (s/vget/vgrab/1,
so we are not linguistically motivated to parallel vput?)

   Traversing mnt_vnodelist needs a better API -- I'm currently working on it.

Sounds good.

   Hold counts are still broken.  They were introduced to prevent the
   the vnode from recycling.  I suspect most users of vhold are more
   or less wrong.

If so, I totally misunderstood them -- but I believe that is how the
current vfs_vnode.c interprets the hold count.  How would it be
different from the usecount as you describe?

I think you're right that most of the three users are wrong: udf and
lfs probably meant to use vref.  But it's not clear to me that the one
call in uvm_page.c is wrong -- if vfs decides to destroy that vnode,
it will call VOP_PUTPAGES to force those pages to their backing store.


Home | Main Index | Thread Index | Old Index