tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: rototilling the vnode life cycle



On Feb 27, 2014, at 7:16 PM, Taylor R Campbell 
<campbell+netbsd-tech-kern%mumble.net@localhost> wrote:

I'm currently short on time but here are some comments and remarks.

First, using an explicit vnode state, v_interlock to inspect or modify,
looks good.

> Prompted by hannken's recent patch to tweak vnode locking, and because
> I couldn't figure out what in Cthulhu's name is going on in the mess
> that is the vnode life cycle for the umpteenth time I looked at it, I
> wrote a draft (attached) of a new vnode life cycle that sorts out all
> of the issues I could find.
> 
> I haven't (even compile-)tested it -- I'm putting it out here for
> review of the concepts and to see whether the description seems
> sensible to someone who is not in my head.  Comments?
> 
> 
> Here are some particular problems it seeks to fix:
> 
> 1. Revocation and destruction are confused.  Reclamation is the only
> thing common to both operations.  Deadfs is for revocation; there is
> no reason to use it for destruction.

I suppose you mean using it for the short time after VOP_RECLAIM()
and before vn_free().  It is still better than using the operations
vector of the file system it has been removed from or using a NULL
vector.

> 2. There are too many states that vnodes can be in, no clear way to
> kassert which ones are which, and no clear statement of what states
> are valid for what vnode subroutines.

Yes.

> 3. I can't prove there aren't races in deactivation and reclamation
> because the code paths are so byzantine, and it's not clear what
> zombie cases file systems have to handle.  (See, e.g., the bizarre
> conditional call to VOP_INACTIVE just before VOP_RECLAIM in vclean.)

Bizarre?  "Always close the vnode except it is an active block device
that is mounted" is quite reasonable, what would you expect from a
mounted file system that lost its device?

> Instead, once we decide to destroy a vnode, it should be final, and
> file systems shouldn't have to handle races in making and effecting
> that decision.

Thats the way it works now.  With the one exception cited above a vnode
reaching vclean will be destroyed (maybe becoming a dead node if it is
active).

> /*
>  * Vnode life cycle
>  *
>  * Vnodes exist in one of six states:
>  *
>  * - UNINITIALIZED
>  * - ACTIVE
>  * - REVOKING
>  * - DEACTIVATING
>  * - INACTIVE
>  * - DESTROYING

Is there a dead state missing?

>  *
>  * To get a file for a vnode on disk, a file system will do this:
>  *
>  * 1. Look it up in inode table; if there, try vget and return.
>  * 2. If not there or if vget fails, getnewvnode and allocate data
>  *    structures needed by the file system.
>  * 3. Look it up again in inode table; if there, try vget,
>  *    ungetnewvnode, free data structures, and return.
>  * 4. If not there or if vget fails, initialize the new vnode, call
>  *    vnode_ready, and store it in the inode table.

This could be done better -- see David Hollands "fixing the vnode
lifecycle" form Sep 22, 2013.

>  *
>  * This yields an ACTIVE vnode and stores it in the inode table.
>  *
>  * When done, either vrele or vrele_async the vnode.  If you held the
>  * last reference (except for the inode table), then the vnode will
>  * become DEACTIVATING, file system's VOP_INACTIVE will be called, and
>  * finally either
>  *
>  * (a) the vnode will become INACTIVE, or
>  * (b) the vnode will become DESTROYING, the file system's VOP_RECLAIM
>  *     will be called, and at last the vnode will be freed.
>  *
>  * If you have an ACTIVE vnode, you can revoke access to it with
>  * vrevoke, usually called by VOP_REVOKE.  This causes it to enter the
>  * REVOKING state temporarily while the file system's VOP_RECLAIM is
>  * called, and then it changes identity to an ACTIVE vnode in deadfs.

Better use an explicit DEAD state here.

>  *
>  * When a vnode is revoked, while it is temporarily REVOKING or when it
>  * is ACTIVE in deadfs, vn_lock fails.  File systems should thus use
>  * vn_lock to lock their own vnodes, and check for failure.
>  *
>  * Vnodes in the DEACTIVATING and DESTROYING states should be seen only
>  * by VOP_INACTIVE and VOP_RECLAIM, respectively.
>  *
>  * The state is preserved by v_interlock.  When a vnode is in the
>  * REVOKING, DEACTIVATING, or DESTROYING state, one thread is
>  * exclusively responsible for taking an action and transitioning the
>  * vnode to another state, or freeing it in the case of DESTROYING.
>  * State transitions from REVOKING and DEACTIVATING are signalled by
>  * the condvar v_cv.  No other state transitions are signalled.
>  *
>  * File system's responsibilities:
>  *
>  * - VOP_INACTIVE is called when a vnode is no longer referenced and is
>  * in the DEACTIVATING state.  It must tell the caller whether the
>  * vnode is still useful, in which case it will become INACTIVE, or not
>  * (e.g., because the last link on disk has been unlinked), in which
>  * case it will be reclaimed.  VOP_INACTIVE must release the vnode
>  * lock.
>  *
>  * - VOP_RECLAIM is called to release resources associated with a vnode
>  * it is either being destroyed (DESTROYING) or revoked (REVOKING).
>  * VOP_RECLAIM must remove the vnode from all file system data
>  * structures where inactive vnodes live, so that when it returns, the
>  * vnode is no longer available for someone to try to vget it.
>  *
>  * - VOP_PUTPAGES and VOP_FSYNC may be given ACTIVE, REVOKING, or
>  * DESTROYING vnodes.
>  *
>  * - VOP_BWRITE may also be given an ACTIVE, REVOKING, or DESTROYING
>  * vnode, but the possibility of REVOKING or DESTROYING looks like an
>  * artefact of an old version of the NFS code before NFS vnodes had
>  * locks.  (See the XXX in vinvalbuf from vfs_subr.c rev. 1.441.)
>  *
>  * - All other vops take ACTIVE vnodes.
>  */
> /*
>  * Vnode allocation
>  */
> 
> /*
>  * vnalloc: Allocate a vnode.  If mp is nonnull, this is a marker vnode
>  * for it; otherwise, it is a normal vnode.  Must be freed with vnfree.
>  */
> struct vnode *
> vnalloc(struct mount *mp)
> {
>       static const struct vnode zero_vnode;
>       struct vnode *vp;
> 
>       if (mp == NULL)         /* not a marker */
>               vdrain_vnode_created();
> 
>       vp = pool_cache_get(vnode_cache, PR_WAITOK);
>       KASSERT(vp != NULL);
> 
>       *vp = zero_vnode;
>       vp->v_state = VS_UNINITIALIZED;
>       uvm_obj_init(&vp->v_uobj, &uvm_vnodeops, true, 0);
>       cv_init(&vp->v_cv, "vnode");
>       LIST_INIT(&vp->v_nclist);
>       LIST_INIT(&vp->v_dnclist);
> 
>       if (mp == NULL) {
>               rw_init(&vp->v_lock);
>       } else {
>               vp->v_mount = mp;
>               vp->v_type = VBAD;
>               vp->v_iflag = VI_MARKER;
>       }
> 
>       return vp;
> }
> 
> /*
>  * vnfree: Free a vnode allocated with vnalloc.
>  *
>  * - vp must be UNINITIALIZED or DESTROYING.
>  */
> void
> vnfree(struct vnode *vp)
> {
>       bool marker;
> 
>       KASSERT((vp->v_state == VS_UNINITIALIZED) ||
>           (vp->v_state == VS_DESTROYING));
>       KASSERT(vp->v_usecount == 0);
> 
>       marker = ISSET(vp->v_iflag, VI_MARKER);
>       if (marker) {
>               KASSERT(vp->v_type == VBAD);
>       } else {
>               rw_destroy(&vp->v_lock);
>       }
> 
>       KASSERT(LIST_EMPTY(&vp->v_dnclist));
>       KASSERT(LIST_EMPTY(&vp->v_nclist));
>       cv_destroy(&vp->v_cv);
>       uvm_obj_destroy(&vp->v_uobj, true);
> 
>       /* Make use-after-free fail noisily.  */
>       vp->v_state = VS_UNINITIALIZED;
> 
>       pool_cache_put(vnode_cache, vp);
> 
>       if (!marker)
>               vdrain_vnode_destroyed();
> }
> /*
>  * Vnode creation
>  */
> 
> /*
>  * getnewvnode: Create a new vnode, which must be destroyed either with
>  * ungetnewvnode or with vnode_ready and then vnode_destroy.
>  */
> int
> getnewvnode(enum vtagtype tag, struct mount *mp, int (**vops)(void *),
>     kmutex_t *interlock, struct vnode *vpp)
> {
>       struct vnode *vp = NULL;
>       int error;
> 
>       if (mp != NULL) {
>               error = vfs_busy(mp, NULL);
>               if (error)
>                       return error;
>       }
> 
>       vp = vnalloc(NULL);
> 
>       KASSERT(vp->v_mount == NULL);
> 
>       vp->v_type = VNON;
>       vp->v_tag = tag;
>       vp->v_op = vops;
>       vp->v_data = NULL;
>       vp->v_writecount = 0;
>       vp->v_holdcnt = 0;
> 
>       /* These should be set up by uvm_obj_init in vnalloc.  */
>       KASSERT(vp->v_usecount == 0);
>       KASSERT(vp->v_uobj.pgops == &uvm_vnodeops);
>       KASSERT(vp->v_uobj.uo_npages == 0);
>       KASSERT(TAILQ_FIRST(&vp->v_uobj.memq) == NULL);
> 
>       vp->v_size = vp->v_writesize = VSIZENOTSET;
> 
>       if (interlock) {
>               mutex_obj_hold(interlock);
>               uvm_obj_setlock(&vp->v_uobj, interlock);
>               KASSERT(vp->v_interlock == interlock);
>       }
> 
>       vfs_insmntque(vp, mp);
> 
>       if (mp != NULL) {
>               if (ISSET(mp->mnt_iflag, IMNT_MPSAFE))
>                       vp->v_vflag |= VV_MPSAFE;
>               /*
>                * XXX This should add a reference to mp, rather than
>                * using a kludgey boolean flag in vfs_unbusy.
>                */
>               vfs_unbusy(mp, true, NULL);
>       }
> 
>       *vpp = vp;
>       return 0;
> }
> 
> /*
>  * ungetnewvnode: Undo a getnewvnode before it is initialized.
>  *
>  * - vp must be UNINITIALIZED.
>  */
> void
> ungetnewvnode(struct vnode *vp)
> {
> 
>       KASSERT(vp->v_state == VS_UNINITIALIZED);
>       KASSERT(vp->v_type == VNON);
>       KASSERT(vp->v_data == NULL);
>       KASSERT(vp->v_writecount == 0);
>       KASSERT(vp->v_holdcnt == 0);
>       KASSERT(vp->v_usecount == 0);
> 
>       vfs_insmntque(vp, NULL);
>       vnfree(vp);
> }
> 
> /*
>  * vnode_ready: Make a vnode from getnewvnode ACTIVE.
>  *
>  * - vp must be UNINITIALIZED.
>  */
> void
> vnode_ready(struct vnode *vp)
> {
> 
>       /*
>        * We ought to have the sole reference to vp here, because it
>        * is newly created, but getnewvnode has put it on the file
>        * systems's vnode list, so some nefarious party trying to walk
>        * that list may get at it first.
>        */
>       mutex_enter(vp->v_interlock);
> 
>       KASSERT(vp->v_state == VS_UNINITIALIZED);
>       KASSERT(vp->v_writecount == 0);
>       KASSERT(vp->v_holdcnt == 0);
>       KASSERT(vp->v_usecount == 0);
> 
>       vp->v_usecount = 1;
>       vp->v_state = VS_ACTIVE;
> 
>       mutex_exit(vp->v_interlock);
> }
> 
> /*
>  * vnode_ready_p: Return true if someone has called vnode_ready(vp),
>  * false if not.  To be used when walking struct mount::mp_vnodelist to
>  * skip entries still being initialized.
>  *
>  * - vp->v_interlock must be held.
>  */
> bool
> vnode_ready_p(struct vnode *vp)
> {
> 
>       KASSERT(mutex_owned(vp->v_interlock));
>       return (vp->v_state != VS_UNINITIALIZED);
> }
> /*
>  * Activating vnodes
>  */
> 
> /*
>  * vget: Try to activate a vnode.  If already ACTIVE, good.  If
>  * INACTIVE, make it ACTIVE.  If DEACTIVATING, wait for it to finish
>  * and retry, or return EBUSY if we can't wait.  Otherwise, fail.
>  *
>  * - vp may be in any state except UNINITIALIZED.
>  * - vp->v_interlock must be held.
>  */
> int
> vget(struct vnode *vp, int flags)
> {
>       int error;
> 
>       KASSERT(mutex_owned(vp->v_interlock));
>       KASSERT(!ISSET(vp->v_iflag, VI_MARKER));
>       KASSERT(!ISSET(flags, ~VGET_WAITOK));
> 
> retry:        switch (vp->v_state) {
>       case VS_UNINITIALIZED:
>               vnpanic(vp, "vget uninitialized vnode");

As it is on the mount list this looks wrong.

> 
>       case VS_ACTIVE: {
>               unsigned int uc;
> 
>               do {
>                       uc = vp->v_usecount;
>                       KASSERT(0 < c);
>                       if (uc == UINT_MAX)
>                               return EBUSY;
>               } while (atomic_cas_uint(&vp->v_usecount, uc, (uc + 1)) != uc);

A simple atomic_add should be sufficient -- we hold the interlock here
so the state cannot change.

> 
>               return 0;
>       }
> 
>       case VS_DEACTIVATING:
>               if (ISSET(flags, VGET_WAITOK)) {
>                       do cv_wait(&vp->v_cv, vp->v_interlock);
>                       while (vp->v_state == VS_DEACTIVATING);
>                       goto retry;
>               }
>               return EBUSY;
> 
>       case VS_INACTIVE:
>               KASSERT(vp->v_usecount == 0);
>               vnode_remove_from_freelist(vp);
>               vp->v_usecount = 1;
>               vp->v_state = VS_ACTIVE;
>               return 0;
> 
>       case VS_REVOKING:
>       case VS_DESTROYING:
>               return ENOENT;

Return without waiting looks bad.

> 
>       default:
>               vnpanic(vp, "vnode in invalid state: %d", (int)vp->v_state);
>       }
> }
> /*
>  * Use counts
>  */
> 
> /*
>  * vref: Bump vp's use count.
>  *
>  * - vp must be ACTIVE or REVOKING.
>  *   (Otherwise use vget to get a reference.)
>  */
> void
> vref(struct vnode *vp)
> {
>       unsigned int usecount;
> 
> #if DIAGNOSTIC
>     {
>       mutex_enter(vp->v_interlock);
>       KASSERT((vp->v_state == VS_ACTIVE) || (vp->v_state == VS_REVOKING));
>       KASSERT(0 < vp->v_usecount);
>       mutex_exit(vp->v_interlock);
>     }
> #endif
> 
>       usecount = atomic_inc_uint_nv(&vp->v_usecount);
>       if (__predict_true(2 <= usecount))
>               return;
>       if (usecount == 0)
>               vnpanic(vp, "vnode reference count overflow");
>       if (usecount == 1)
>               vnpanic(vp, "vref of unreferenced vnode");
> }
> 
> /*
>  * vrele: Drop vp's use count.  If it drops to zero, call VOP_INACTIVE
>  * and maybe VOP_RECLAIM.  May take and drop vp->v_interlock.
>  *
>  * Because this may call VOP_INACTIVE or VOP_RECLAIM synchronously, it
>  * may take vp's vnode lock, so caller must not hold any locks that are
>  * out of order with vp's vnode lock.  If this is an issue, or if
>  * calling VOP_INACTIVE or VOP_RECLAIM is otherwise an issue, use
>  * vrele_async instead.
>  *
>  * - vp must be ACTIVE or REVOKING.
>  */
> void
> vrele(struct vnode *vp)
> {
> 
>       vrele_with(vp, &vnode_deactivate);

Take v_interlock.

> }
> 
> /*
>  * vrele_async: Drop vp's use count.  If it drops to zero, schedule an
>  * asynchronous call to VOP_INACTIVE and maybe VOP_RECLAIM.  May take
>  * and drop vp->v_interlock.
>  *
>  * - vp must be ACTIVE or REVOKING.
>  */
> void
> vrele_async(struct vnode *vp)
> {
> 
>       vrele_with(vp, &vnode_deactivate_async);

Dito.

> }
> 
> static inline void
> vrele_with(struct vnode *vp, void (*deactivate)(struct vnode *vp))
> {
> 
>       if (__predict_true(atomic_dec_uint_lock_if_zero(&vp->v_usecount,
>                   vp->v_interlock)))
>               return;
> 
>       KASSERT(mutex_owned(vp->v_interlock));
>       KASSERT((vp->v_state == VS_ACTIVE) || (vp->v_state == VS_REVOKING));
>       KASSERT(vp->v_usecount == 0);
>       vp->v_state = VS_DEACTIVATING;
> 
>       (*deactivate)(vp);
> 
>       mutex_exit(vp->v_interlock);
> }
> 
> static void
> vnode_deactivate(struct vnode *vp)
> {
>       bool destroy;
> 
>       KASSERT(mutex_owned(vp->v_interlock));
>       KASSERT(vp->v_state == VS_DEACTIVATING);
> 
>       mutex_exit(vp->v_interlock);
>       VOP_LOCK(vp, LK_EXCLUSIVE);     /* XXX This is silly.  */
>       VOP_INACTIVE(vp, &destroy);
>       mutex_enter(vp->v_interlock);
> 
>       KASSERT(vp->v_state == VS_DEACTIVATING);
>       vp->v_state = VP_INACTIVE;
>       cv_broadcast(&vp->v_cv);
> 
>       if (destroy)
>               vnode_destroy(vp);
>       else
>               vnode_add_to_freelist(vp);
> }
> 
> static void
> vnode_deactivate_async(struct vnode *vp)
> {
> 
>       KASSERT(mutex_owned(vp->v_interlock));
>       KASSERT(vp->v_state == VS_DEACTIVATING);
> 
>       /* Pretend it's active until the thread can get to it.  */
>       vp->v_state = VS_ACTIVE;
>       vp->v_usecount = 1;
>       cv_broadcast(&vp->v_cv);
> 
>       mutex_enter(&vrele_lock);
>       TAILQ_INSERT_TAIL(&vrele_list, vp, v_freelist);
>       /* XXX Why delay?  */
>       /* XXX Use a per-CPU workqueue instead?  */
>       if ((vrele_pending == UINT_MAX) ||
>           (++vrele_pending > (desiredvnodes >> 8)))
>               cv_signal(&vrele_cv);
>       mutex_exit(&vrele_lock);
> }
> 
> static void
> vrele_thread(void *arg __unused)
> {
>       struct vnode *vp;
> 
>       for (;;) {
>               mutex_enter(&vrele_lock);
>               while (TAILQ_EMPTY(&vrele_list))
>                       cv_wait(&vrele_cv, &vrele_lock);
>               vp = TAILQ_FIRST(&vrele_list);
>               TAILQ_REMOVE(&vrele_list, vp, v_freelist);
>               /* May be zero if we overflowed, but that's OK.  */
>               if (vrele_pending)
>                       vrele_pending--;
>               mutex_exit(&vrele_lock);
> 
>               vrele(vp);
>       }
> }
> /*
>  * Vnode lock
>  */
> 
> /*
>  * vn_lock: Lock vp with VOP_LOCK.  Fail and leave it unlocked if it
>  * has been revoked.  Return zero on success, error code on failure.
>  *
>  * - vp must be ACTIVE or REVOKING.
>  */
> int
> vn_lock(struct vnode *vp, int flags)
> {
>       int error = 0;
> 
>       VOP_LOCK(vp, flags);
> 
>       mutex_enter(vp->v_interlock);
>       if (__predict_false(vp->v_state == VS_REVOKING) ||
>           (vp->v_op == dead_vnodeop_p))
>               error = ENOENT;
>       else
>               KASSERT(vp->v_state == VS_ACTIVE);
>       mutex_exit(vp->v_interlock);
> 
>       if (error)
>               VOP_UNLOCK(vp);

This doesn't work as VOP_UNLOCK() may or will use a different
operations vector than VOP_LOCK() above.  There was a reason for
my change to VOP_LOCK() some days ago.

> 
>       return error;
> }
> 
> /*
>  * vn_lock_deadok: Lock vp with VOP_LOCK.  If it is being revoked, wait
>  * until it is revoked and then lock it as a deadfs node with VOP_LOCK.
>  *
>  * - vp must be ACTIVE or REVOKING.
>  */
> void
> vn_lock_deadok(struct vnode *vp, int flags)
> {
> 
> retry:        VOP_LOCK(vp, flags);
>       mutex_enter(vp->v_interlock);
>       if (__predict_false(vp->v_state == VS_REVOKING)) {
>               VOP_UNLOCK(vp);
>               do cv_wait(&vp->v_cv, vp->v_interlock);
>               while (vp->v_state == VS_REVOKING);
>               goto retry;
>       } else {
>               KASSERT(vp->v_state == VS_ACTIVE);
>       }
>       mutex_exit(vp->v_interlock);
> }
> 
> /*
>  * vn_unlock: Unlock a vnode that was locked with vn_lock or
>  * vn_lock_deadok.
>  *
>  * - vp's vnode lock must be held.
>  * - vp must be ACTIVE or REVOKING.
>  */
> void
> vn_unlock(struct vnode *vp)
> {
> 
>       VOP_UNLOCK(vp);
> }
> /*
>  * Revocation and reclamation
>  */
> 
> /*
>  * vrevoke: Turn vp into a dead vnode, to implement VOP_REVOKE.  Will
>  * take and drop vp->v_interlock.
>  *
>  * - vp must be ACTIVE.
>  *
>  * NOTE: If you hold the vnode lock on entry to this, either you must
>  * use the same VOP_UNLOCK as deadfs afterward or you must arrange some
>  * other way to have exclusive access.  (XXXRACE Currently genfs_revoke
>  * does NOT guarantee exclusive access!)
>  *
>  * XXX Caller currently can't hold vnode lock because vnode_reclaim
>  * takes it.  We could do unlock/reclaim/lock if we made all callers do
>  * lock/vrevoke/unlock, though.
>  *
>  * XXX The old vrevoke would replace v_op by spec_vnodeop_p in some
>  * cases.  I have not yet disentangled what those cases are.  Help?
>  */
> void
> vrevoke(struct vnode *vp)
> {
> 
> #if notyet                    /* XXX */
>       KASSERT(VOP_ISLOCKED(vp) == LK_EXCLUSIVE);
> #endif
> 
>       mutex_enter(vp->v_interlock);
>       KASSERT(vp->v_state == VS_ACTIVE);
>       KASSERT(0 < vp->v_usecount);
> 
>       vp->v_state = VS_REVOKING;
>       mutex_exit(vp->v_interlock);
> #if notyet                    /* XXX */
>       VOP_UNLOCK(vp);
> #endif
>       vnode_reclaim(vp);
> #if notyet                    /* XXX */
>       VOP_LOCK(vp, LK_EXCLUSIVE);
> #endif
>       mutex_enter(vp->v_interlock);
>       KASSERT(vp->v_state == VS_REVOKING);
> 
>       vp->v_state = VS_ACTIVE;
>       /*
>        * XXX Pointer writes must be atomic; otherwise concurrent
>        * VOP_* trying to use v_op will fail horribly.

It is not a problem with atomic writes -- vnode operations working
on unlocked vnodes may still be running and this is horrible.

>        */
>       vp->v_op = dead_vnodeop_p;      /* XXX What about specnodes?  */
>       vp->v_tag = VT_NON;
>       KNOTE(&vp->v_klist, NOTE_REVOKE);
>       cv_broadcast(&vp->v_cv);
> 
>       KASSERT(!ISSET(vp->v_iflag, VI_ONWORKLST));
> 
>       mutex_exit(vp->v_interlock);

Revoking a device node must revoke all aliases.

> }
> 
> /*
>  * vrecycle: If vp is inactive, then take vp->v_interlock, unlock
>  * interlock if nonnull, destroy vp, and return true.  Otherwise,
>  * return false.
>  *
>  * XXX Should this wait for vnodes in changing states?  Old vrecycle
>  * didn't, but it's not clear whether that's a bug or that's what the
>  * callers actually want.
>  */
> int
> vrecycle(struct vnode *vp, kmutex_t *interlock)
> {
> 
>       mutex_enter(vp->v_interlock);
>       KASSERT(!ISSET(vp->v_iflag, VI_MARKER));
> 
> #if notyet                    /* XXX */
>       while ((vp->v_state == VS_REVOKING) ||
>           (vp->v_state == VS_DEACTIVATING))
>               cv_wait(&vp->v_cv, vp->v_interlock);
> #endif
> 
>       if (vp->v_state != VS_INACTIVE) {
>               mutex_exit(vp->v_interlock);
>               return 0;
>       }
> 
>       if (interlock)
>               mutex_exit(interlock);
>       vnode_remove_from_freelist(vp);
>       vnode_destroy(vp);
>       return 1;

Vrecycle() has to change -- I'm currently working on it.

> }
> 
> /*
>  * vnode_destroy: Do bookkeeping, reclaim vp, and free vp.  Will drop
>  * vp->v_interlock and will take and drop the vnode lock.
>  *
>  * - vp must be INACTIVE.
>  * - vp->v_interlock must be held.
>  *
>  * VOP_RECLAIM guarantees we have the sole reference to vp when
>  * vnode_reclaim returns, even after we drop vp->v_interlock.
>  */
> static void
> vnode_destroy(struct vnode *vp)
> {
>       int error __diagused;
> 
>       KASSERT(mutex_owned(vp->v_interlock));
>       KASSERT(vp->v_state == VS_INACTIVE);
> 
>       vp->v_state = VS_DESTROYING;
> 
>       /* XXX Begin crud cargo-culted from old vclean.  */
>       if (ISSET(vp->v_iflag, VI_EXECMAP)) {
>               atomic_add_int(&uvmexp.execpages, -vp.v_uobj.uo_npages);
>               atomic_add_int(&uvmexp.filepages, vp.v_uobj.uo_npages);
>       }
>       /* XXX Old vrelel cleared VI_WRMAP; old vclean didn't.  Hmm.  */
>       vp->v_iflag &= ~(VI_TEXT | VI_EXECMAP | VI_WRMAP);
>       /* XXX End crud cargo-culted from old vclean.  */
> 
>       mutex_exit(vp->v_interlock);
> 
>       /* Nuke the file system's data structures.  */
>       vnode_reclaim(vp);
> 
>       /*
>        * XXX How do we ensure nobody grabs onto a reference from
>        * mnt_vnodelist before we've taken it off here?  Can we do
>        * this before vnode_reclaim?
>        */

Traversing mnt_vnodelist needs a better API -- I'm currently working on it.

>       vfs_insmntque(vp, NULL);
>       vnfree(vp);
> }
> 
> /*
>  * vnode_reclaim: Flush buffers associated with vp, call VOP_RECLAIM,
>  * and purge caches related to vp in preparation for revoking or
>  * destroying it.
>  *
>  * - vp must be DESTROYING or REVOKING.
>  *
>  * XXX It would be nice if it were not necessary to take the vnode lock
>  * here, because then we could do reclamation synchronously in
>  * getnewvnode.
>  */
> static void
> vnode_reclaim(struct vnode *vp)
> {
> 
> #if DIAGNOSTIC
>     {
>       mutex_enter(vp->v_interlock);
>       KASSERT((vp->v_state == VS_DESTROYING) || (vp->v_state == REVOKING));
>       mutex_exit(vp->v_interlock);
>     }
> #endif
> 
>       /* XXX Begin crud cargo-culted from old vclean.  */
>       VOP_LOCK(vp, LK_EXCLUSIVE);
>       /* XXX XXX Can vinvalbuf reactivate vp?  That would suck.  */
>       error = vinvalbuf(vp, V_SAVE, NOCRED, 1, 0, 0);
>       if (error) {
>               if (wapbl_vphaswapbl(vp))
>                       WAPBL_DISCARD(wapbl_vptomp(vp));
>               error = vinvalbuf(vp, 0, NOCRED, 1, 0, 0);
>               KASSERT(error == 0);
>       }
>       KASSERT(!ISSET(vp->v_iflag, VI_ONWORKLST));
>       VOP_UNLOCK(vp, LK_EXCLUSIVE);
>       /* XXX End crud cargo-culted from old vclean.  */
> 
>       VOP_RECLAIM(vp);
>       KASSERT(vp->v_data == NULL);
>       KASSERT(vp->v_uobj.uo_npages == 0);
> 
>       /* XXX Begin crud cargo-culted from old vclean.  */
>       if ((vp->v_type == VREG) && (vp->v_ractx != NULL)) {
>               uvm_ra_freectx(vp->v_ractx);
>               vp->v_ractx = NULL;
>       }
>       cache_purge(vp);
>       /* XXX End crud cargo-culted from old vclean.  */
> }
> /*
>  * Hold counts.  When there are buffers in the kernel (buffer cache or
>  * uvm) for a vnode, we would prefer to destroy that vnode later.  The
>  * hold count records how many such buffers there are.
>  */

Hold counts are still broken.  They were introduced to prevent the
the vnode from recycling.  I suspect most users of vhold are more
or less wrong.

> 
> /*
>  * vholdl: Bump vp's hold count.
>  *
>  * - vp must be INACTIVE, ACTIVE, or REVOKING.
>  * - vp->v_interlock must be held.
>  * - vp must not be a marker vnode.
>  */
> void
> vholdl(struct vnode *vp)
> {
> 
>       KASSERT(mutex_owned(vp->v_interlock));
>       KASSERT(!ISSET(vp->v_iflag, VI_MARKER));
>       KASSERT((vp->v_state == VS_INACTIVE) ||
>           (vp->v_state == VS_ACTIVE) ||
>           (vp->v_state == VS_REVOKING));
> 
>       if (vp->v_holdcnt == UINT_MAX)
>               vnpanic(vp, "vnode hold count overflow");
>       vp->v_holdcnt++;
> 
>       if ((vp->v_state == VS_INACTIVE) && (vp->v_holdcnt == 1)) {
>               /*
>                * If it was inactive and unheld, it is on the free
>                * list and should now be on the hold list, so move it.
>                */
>               mutex_enter(&vnode_free_list_lock);
>               KASSERT(vp->v_freelisthd == &vnode_free_list);
>               TAILQ_REMOVE(vp->v_freelisthd, vp, v_freelist);
>               vp->v_freelisthd = &vnode_hold_list;
>               TAILQ_INSERT_TAIL(vp->v_freelisthd, vp, v_freelist);
>               mutex_exit(&vnode_free_list_lock);
>       }
> }
> 
> /*
>  * holdrelel: Drop vp's hold count.
>  *
>  * - vp must be INACTIVE, ACTIVE, or REVOKING.
>  * - vp->v_interlock must be held.
>  * - vp must not be a marker vnode.
>  */
> void
> holdrelel(struct vnode *vp)
> {
> 
>       KASSERT(mutex_owned(vp->v_interlock));
>       KASSERT((vp->v_state == VS_INACTIVE) ||
>           (vp->v_state == VS_ACTIVE) ||
>           (vp->v_state == VS_REVOKING));
>       KASSERT(!ISSET(vp->v_iflag, VI_MARKER));
> 
>       KASSERT(0 < vp->v_holdcnt);
>       vp->v_holdcnt--;
> 
>       if ((vp->v_state == VS_INACTIVE) && (vp->v_holdcnt == 0)) {
>               /*
>                * If it was inactive and this was the last hold, it
>                * was on the hold list and should now be on the free
>                * list, so move it.
>                */
>               mutex_enter(&vnode_free_list_lock);
>               KASSERT(vp->v_freelisthd == &vnode_hold_list);
>               TAILQ_REMOVE(vp->v_freelisthd, vp, v_freelist);
>               vp->v_freelisthd = &vnode_free_list;
>               TAILQ_INSERT_TAIL(vp->v_freelisthd, vp, v_freelist);
>               mutex_exit(&vnode_free_list_lock);
>       }
> }
> /*
>  * Freelists.  Vnodes that are not actively being used stay cached in
>  * case someone wants them soon, but get queued up to be destroyed when
>  * the number of vnodes in the system gets too high.  Vnodes not used
>  * by buffers in the kernel are on the /free list/, and get destroyed
>  * first; vnodes used by buffers in the kernel are on the /hold list/,
>  * and get destroyed after everything in the free list.

The current documentation is different -- see above.

>  *
>  * Reclamation happens asynchronously, in the vdrain thread.  Each file
>  * system's VOP_RECLAIM cannot allocate or otherwise wait for resources
>  * that allocating vnodes in any file system may require.  (Yikes!)
>  */
> 
> /*
>  * vnode_add_to_freelist: Add vp to the free list or hold list as
>  * appropriate.
>  *
>  * - vp must be INACTIVE.
>  * - vp->v_interlock must be held.
>  * - vp must not be a marker vnode.
>  */
> static void
> vnode_add_to_freelist(struct vnode *vp)
> {
> 
>       KASSERT(mutex_owned(vp->v_interlock));
>       KASSERT(vp->v_state == VS_INACTIVE);
>       KASSERT(!ISSET(vp->v_iflag, VI_MARKER));
> 
>       mutex_enter(&vnode_free_list_lock);
>       vp->v_freelisthd = (0 == vp->v_holdcnt?
>           &vnode_free_list : &vnode_hold_list);
>       TAILQ_INSERT_TAIL(vp->v_freelisthd, vp, v_freelist);
>       mutex_exit(&vnode_free_list_lock);
> }
> 
> /*
>  * vnode_remove_from_freelist: Remove vp from whichever freelist it is
>  * on.
>  *
>  * - vp must be INACTIVE.
>  * - vp->v_interlock must be held.
>  * - vp must not be a marker vnode.
>  */
> static void
> vnode_remove_from_freelist(struct vnode *vp)
> {
> 
>       KASSERT(mutex_owned(vp->v_interlock));
>       KASSERT(vp->v_state == VS_INACTIVE);
>       KASSERT(!ISSET(vp->v_iflag, VI_MARKER));
> 
>       mutex_enter(&vnode_free_list_lock);
>       KASSERT(vp->v_freelisthd == (0 == vp->v_holdcnt?
>               &vnode_free_list : &vnode_hold_list));
>       TAILQ_REMOVE(vp->v_freelisthd, vp, v_freelist);
>       vp->v_freelisthd = NULL;
>       mutex_exit(&vnode_free_list_lock);
> }
> 
> static void
> vdrain_vnode_created(struct vnode *vp)
> {
> 
>       mutex_enter(&vnode_free_list_lock);
>       if (numvnodes == UINT_MAX)
>               vnpanic(vp, "too many vnodes");
>       numvnodes++;
>       if ((desiredvnodes + (desiredvnodes/10)) < numvnodes)
>               cv_signal(&vdrain_cv);
>       mutex_exit(&vnode_free_list_lock);
> }
> 
> static void
> vdrain_vnode_destroyed(struct vnode *vp)
> {
> 
>       mutex_enter(&vnode_free_list_lock);
>       numvnodes--;
>       mutex_exit(&vnode_free_list_lock);
> }
> 
> static void
> vdrain_thread(void *arg __unused)
> {
> 
>       for (;;) {
>               mutex_enter(&vnode_free_list_lock);
>               while (numvnodes < desiredvnodes)
>                       cv_wait(&vdrain_cv, &vnode_free_list_lock);
>               if (vdrain_1() == EBUSY)
>                       kpause("vdrain", false, hz, NULL);
>               /* vdrain_1 drops vnode_free_list_lock for us.  */
>       }
> }
> 
> static void
> vdrain_1(void)
> {
>       static struct vnodelst *freelists[] = {
>               &vnode_free_list, &vnode_hold_list,
>       };
>       size_t i;
>       struct vnode *vp;
>       struct mount *mp;
>       int error = ENOENT;
> 
>       KASSERT(mutex_owned(&vnode_free_list_lock));
> 
>       for (i = 0; i < __arraycount(freelists); i++) {
>               if (TAILQ_EMPTY(freelists[i]))
>                       continue;
>               TAILQ_FOREACH(vp, freelists[i], v_freelist) {
>                       /*
>                        * XXX Lock order reversal!  We can get rid of
>                        * this by removing vp from the queue before
>                        * taking its interlock and putting it back on
>                        * the queue if the fstrans can't start.
>                        * However, that also requires changing
>                        * everything else that manages vnodes on the
>                        * freelists to handle the case that vdrain may
>                        * have taken the vnode off the freelist and
>                        * may be about to put it back on.  That is
>                        * more trouble than it is worth to avoid a
>                        * single speculative grab of this vnode.
>                        */
>                       if (!mutex_tryenter(vp->v_interlock)) {
>                               error = EBUSY;
>                               continue;
>                       }
>                       KASSERT(vp->v_state == VS_INACTIVE);
>                       KASSERT(vp->v_usecount == 0);
>                       KASSERT(vp->v_freelisthd == freelists[i]);
>                       mp = vp->v_mount;
>                       if (fstrans_start_nowait(mp, FSTRANS_SHARED) != 0) {
>                               mutex_exit(vp->v_interlock);
>                               error = EBUSY;
>                               continue;
>                       }
>                       goto found;
>               }
>       }
> 
>       mutex_exit(&vnode_free_list_lock);
>       return error;
> 
> found:        TAILQ_REMOVE(freelists[i], vp, v_freelist);
>       vp->v_freelisthd = NULL;
>       mutex_exit(&vnode_free_list_lock);
> 
>       vnode_destroy(vp);
>       fstrans_done(mp);
>       return 0;
> }


--
J. Hannken-Illjes - hannken%eis.cs.tu-bs.de@localhost - TU Braunschweig 
(Germany)



Home | Main Index | Thread Index | Old Index