fixing the vnode lifecycle

To: tech-kern%netbsd.org@localhost
Subject: fixing the vnode lifecycle
From: David Holland <dholland-tech%netbsd.org@localhost>
Date: Sun, 22 Sep 2013 03:28:58 +0000
jakllsch@ and I just spent quite some time (so far unsuccessfully)
trying to figure out a hack to keep afs from deadlocking in vget().
As a consequence of this I've been looking through a bunch of the
vnode lifecycle code and my irritation level has gone past some
critical threshold.

So, here's some thinking out loud on the subject of fixing it.

 - It's clear that the vnode lifecycle code needs to be pretty much
ripped out and reworked from the ground up. It is full of races that
have been papered over instead of fixed, and then papered over again
and again when the original hacks proved insufficient.

 - To my intense disgust I discovered recently that even though
there's ostensibly a vnode cache, and the vfs-level code keeps track
of all vnodes, there is no lookup functionality provided so every file
system is also obliged to maintain its own private vnode table.

 - While a casual inspection of the code might lead you to believe
that vnode objects can be reused for different files, this is not in
fact the case. They pass through the allocator. It is possible for the
same vnode to be 'reused' for a new file using the same inode number
on the same filesystem... sometimes.

 - In addition to (or, more accurately, as a consequence of) being
full of races, the vnode lifecycle exposes a bunch of invalid vnode
states that should not need to exist. It's one of these that afs is
tripping on.

So:

First, obviously the vfs-level vnode cache code should provide vnode
lookup so file systems don't need to maintain their own vnode
tables. Killing off the fs-level vnode tables not only simplifies the
world but also avoids exposing a number of states and operations
required only to keep the fs-level table in sync with everything else.
This will help a great deal to get rid of the bodgy locking and all
the races.

(If any file system does anything fancy with its vnode table, other
than iterating it and doing lookups by inode number, I'd like to find
out.)

Second, properly speaking the only states a live vnode can be in are
"active" and "inactive", meaning currently in use by the file system
(or by the VM system via mmap), or not. Non-live vnodes (either
freshly allocated and uninitialized, or about to be freed) should not
be visible outside vfs_vnode.c, and, for correctness, must not be
associated with any file system or inode number as far as vnode lookup
is concerned.

(I'm going to use "attached" to mean "has a specific file system and
inode number associated with it" and "detached" for the opposite.)

There's a basic problem with this picture, which is that vnodes have
per-fs data, which includes both purely in-memory structures and also
on-disk material that needs to be loaded. And therefore, a third vnode
state (where this data is not yet, or not any more, present) has to be
exposed, at least to the fs.

Proposition: this state should be exposed as little as possible, and
the vfs-level vnode cache should take responsibility (as much as
possible) for making sure it is only exposed once and never
concurrently to more than one process.

The next issue is whether a new vnode should be initialized before or
after it's attached to the cache, and similarly whether a dying vnode
should be cleaned up before or after it's detached from the cache. One
can argue this either way.

In some ways it's tidiest for the fs to cons up vnodes, hand them to
the cache when they're ready to use, and get them back to clean up
once the cache thinks they're over; that is, attached vnodes are
always initialized. The problem with this model is that it exposes a
race: if two processes try to load the same inode, they'll both cons
up vnodes, and then one will fail to insert its result into the cache
and have to destroy it again, possibly at some expense. It is also
possible (depending on the fs) that trying to load the same inode from
more than one process at a time might create invalid state in the
buffer cache (leading to crashes/panics) or even deadlock. Also, if
cleaning up a vnode involves writing anything to disk (as opposed to
just deallocating memory), very bad things will happen if this
overlaps other processes attempting to reload the same inode. This can
be fixed by guaranteeing that the inode will be synced before it's
detached, or by only detaching synced inodes.

The model where vnodes are attached first and then loaded, however,
creates a locking complication, because something (however small) must
be left locked in the vnode cache while the vnode is loaded in order
to avoid returning an unloaded vnode indiscriminately. Holding vfs-
level locks while calling into the fs is not necessarily a problem;
but it can be if one isn't careful.

A worse locking complication arises, however, because it's
fundamentally unsafe to lock a VOP_LOOKUP result; doing so, or at
least doing so unconditionally, creates potential deadlock cases in
rename. This means that looking up a vnode cannot lock the vnode.
(Currently it does, but this is a problem.)

Also, any model where the fs calls into the vnode cache to do vnode
lookup and the vnode cache then calls back into the fs to load a vnode
is likely to have deadlock problems.

On reflection I think the best model is:
  - attach before initialization
  - detach before cleanup
as follows:

1. In lookup, the file system calls something to retrieve a vnode by
fs and inode number. (This supersedes vget(). Let's call it
vncache_lookup() for now.)

2. This returns an attached vnode that might be not-yet-loaded. If it
is not-yet-loaded, the fs calls something internal to load it. (This
something supersedes the rest of the vget()/VFS_VGET() mechanism.)
When it's done loading, it calls the vnode cache to say so. (Let's
call this vncache_markloaded().) The vnode cache guarantees that any
other processes trying to get the vnode for the same fs and inode
number block until vncache_markloaded() is called.

3. Now the fs has a attached, loaded, and active vnode and can return
it to namei.

4. vncache_lookup() never returns a locked vnode. In the case of a
vnode that hasn't been loaded yet, the fs is guaranteed to have the
only reference to it until vncache_markloaded() is called, so it can
load the inode without needing to lock the vnode.

5. (Currently the fs's lookup routine needs to lock the vnode before
returning it to namei because that's the protocol, but we want to fix
that in the future.)

6. When the last active reference is dropped, VOP_INACTIVE is called,
much like it currently is, except that all the crap like VI_INACTNOW
needs to go away. I see no reason that the vnode cache shouldn't just
lock the vnode with the normal vnode lock while calling VOP_INACTIVE.
For now this may require making VOP_INACTIVE calls from a separate
worker thread, but I don't see that this is a problem.

Further points:

 - The same mechanism used to protect vnode loading can be used to
make sure new references do not appear until VOP_INACTIVE completes.
I'm not sure if this is actually necessary or not -- if VOP_INACTIVE
is called with the vnode locked it shouldn't really matter.

 - At some point after VOP_INACTIVE the vnode cache will call what
I'll call VOP_ISYNC to make sure everything that needs to be written
out gets written out. VOP_ISYNC is probably not the right name for
this.

 - When the vnode cache decides it's time for the vnode to go away
entirely, it detaches the vnode from the cache and then calls
VOP_DESTROY. Logic in the current VOP_RECLAIM gets split between
VOP_ISYNC and VOP_DESTROY.

 - The contract with VOP_DESTROY is that everything that it's not safe
to do after the vnode is detached must be done in VOP_ISYNC,
specifically all I/O and updates to fs data structures. VOP_DESTROY
should really only be freeing memory. However, VOP_ISYNC isn't
supposed to do anything that would prevent the vnode from becoming
active again later. (That way it doesn't need magic locking.)

 - Like currently there should be a short path from VOP_INACTIVE
directly to VOP_DESTROY for inodes that have been deleted.

 - It might be that an examination of VOP_RECLAIM implementations will
show that this is unworkable, in which case a different scheme will be
needed: lock out further references, call VOP_DESTROY, and *then*
detach the vnode from the cache, symmetric with the load mechanism.

 - The mechanism I have in mind for locking out references is to stick
a lock in the hash buckets in the vnode cache: when looking up a vnode
you find the hash chain, then the bucket, then get that lock; once you
have it you can incref the vnode. For newly created vnodes, you malloc
an empty vnode and bucket, lock the bucket, insert things, then leave
the bucket locked until vncache_markloaded() is called.

 - If a fs goes off into the weeds and never calls
vncache_markloaded(), it's a problem, but not really a different
problem from any other fatal kernel bug.

 - Secondary references (what we currently call "hold" references) to
vnodes do not affect its externally visible state but they prevent the
vnode cache from calling VOP_DESTROY. (They should not, I think,
prevent it from calling VOP_ISYNC, as if there's a buffer to write out
that would prevent it from ever happening, and if there isn't a buffer
to write out there probably isn't anything to do in VOP_ISYNC at all.)

 - Currently the namecache holds vnodes without taking references of
any kind to them, and the vnode lifecycle code explicitly flushes the
namecache at necessary points. I see no reason to change this in the
short term. In the long term it should take secondary ("hold")
references as a matter of tidiness. However, it needs a way to take
one of its shady references and produce a real one. Currently it uses
vtryget(). It looks to me like vtryget() should be eliminated and
replaced with a special-purpose function, and I don't see any reason
it should continue to need to use trylocks.

 - VI_XLOCK and its even uglier sibling VC_XLOCK can go away; they
arise because detaching a vnode from the cache ("cleaning") is not
currently atomic.

 - It should also be possible to kill off VI_INACTREDO. From what I
can tell, it's meant to handle the case where additional references
are taken and dropped again during the call to VOP_INACTIVE. Holding
the vnode lock during VOP_INACTIVE should make it impossible to do
anything during that time that requires an additional VOP_INACTIVE;
but failing that the mechanism for locking out additional references
should make all this unnecessary.

 - VI_INACTNOW seems to be mostly a bandaid for another set of race
conditions; however it's also used to enable lockfree vrele(). I
question the value of this micro-optimization... but I think it may be
possible to keep it, since it appears that the only reason lockfree
vrele() needs to check VI_INACTNOW is the existence of VI_INACTREDO.

 - v_interlock vs. v_lock: other than as described already I see no
need to change this. (But as I keep saying, can we please come up with
a better name than "v_interlock"?)

 - All accesses to v_usecount can be via atomic ops. Or at least, all
vfs accesses. It is an alias for a uvm structure member (uo_refs) and
I'm not sure if uvm does anything else on its own; it looks like not.
The VC_XLOCK scheme can go away. (And it doesn't currently appear to
be consistently or completely implemented.)


So far this is all blather and no action, and I don't know when I'll
have time to work on it (if at all) but formulating even a halfassed
plan is always a good first step.

(I also have no real idea yet how to get to where I'm describing from
where we are in a decently incremental fashion.)

-- 
David A. Holland
dholland%netbsd.org@localhost
Follow-Ups:
- Re: fixing the vnode lifecycle
  - From: J. Hannken-Illjes
- Re: fixing the vnode lifecycle
  - From: David Holland
Prev by Date: Re: high load, no bottleneck
Next by Date: Re: fixing the vnode lifecycle
Previous by Thread: Weird memory usage, performance problem with 6.1_STABLE
Next by Thread: Re: fixing the vnode lifecycle
Indexes:
Home | Main Index | Thread Index | Old Index