Cleaning up namei

To: tech-kern%netbsd.org@localhost
Subject: Cleaning up namei
From: David Holland <dholland-tech%netbsd.org@localhost>
Date: Wed, 19 Mar 2008 05:37:09 +0000
As some of you may know or have heard, I have been loitering with
intent near the namei code lately. It is a horrible mess, it is
causing fairly serious problems by being a mess, and I want it to
stop.


What I'm intending to do/where I'm going
----------------------------------------

My basic intent is to abolish struct nameidata and struct
componentname entirely. Right now, the contents of these structures
function more or less like global variables; this is the primary
reason all of the namei-related code is so dodgy.

I am also going to be pushing work up out of namei to the callers;
another major problem is that namei tries to do everything for
everybody, and so pieces of the logic of various operations are
embedded in it when they really ought to be separate. This may lead to
some code duplication in the short run. (In the long run I'd like to
create a family of vfs_* and vn_* functions that provide all the file
ops, thus drawing a distinction between the vfs code and the system
call layer; this way compat code and nfsd and other things can sit on
top of vfs much more cleanly. But this isn't happening right away.)

The basic model I have in mind is that there will be two main calls:
namei() and namei_parent(). The former will go path -> vnode; the
latter will go path -> directory vnode and last name component, which
is the basic building block for open/creat, mkdir, rmdir, link,
unlink, rename, and so forth. There will also be sub-operations
namei_begin(), namei_once(), and namei_follow().

I am also intending to introduce a struct pathbuf, which is an
abstraction to hold a pathname - the primary value of this is that
because it is known to always be size PATH_MAX the namei code can
reuse it to follow symlinks, which turns out to simplify allocation a
good deal. But it also makes the copyin() code for pathnames more
stylized, and thus easier to both crosscheck and, whenever we're ready
to do so, generate automatically.

(I discovered the value of struct pathbuf accidentally while hacking a
research kernel; I was tinkering with namespaces and wanted to
abstract away the structure of a filename so it could be a string or a
namespace id and a string; the research proved a dead end, but
refitting my vfs code to use struct pathbuf in order to support it
turned out to make it a good bit cleaner. So I think it's a good idea
in general.)

The other major change I intend is that I'm going to simplify the
locking. While namei will hold the lock of a directory while looking
in it (which is necessary), it will not hold more than one lock at a
time and it will never (with one minor exception I'll discuss later)
return a locked vnode.

This specifically abolishes the traditional lock coupling that we (and
~everyone else) uses while searching down the tree. There is no need
for it: all it prevents is, if two processes are following the same
path, one catching up to and passing the other on the way down.  In
general this is harmless, because lookup is a read operation and read
operations commute freely with one another.

It is not quite as clear that it's harmless if it happens at the
bottom of a path, when one or both of the operations in progress are
directory writes of some kind (e.g., mkdir and rename) - but in this
world these cases all do namei_parent() and then explicitly lock the
parent directory until they're done, so the following operation has to
wait until the first completes.

So far the only case I can think of that even might pose a problem is
that it becomes possible to stat() a pathname, succeed, and get back a
struct stat that says the link count is 0. I am not convinced that
this is a serious problem, but if it is, we can easily hack around
it. (However, I have discovered some other races that the lock
coupling doesn't prevent; e.g. you can get formally inconsistent
results out of link() if the timing is just right. And of course
exact consistency is a lost cause for paths that involve "..".)

(I have been thinking about preparing a formal model and a proof that
the lock coupling isn't necessary, because the supposed need for it is
a long-standing piece of kernel lore and therefore lots of people are
skeptical. But so far I haven't, partly because it's work and partly
because the existence of other races like the one in link will make
proving anything useful difficult.)

Anyhow. I have written and tested this VFS design in a research kernel
and to the best of my knowledge it handles all the basic cases
correctly, including things like create-follows-symlinks. It is not
yet clear exactly how much extra widgetry will be needed to support
things my research kernel didn't have, like TRYEMULROOT and
MOUNT_UNION, whiteouts, and compat code that does odd things.

However, here's what the basic parts look like in pseudocode without
error handling. Note that pathbuf is stateful and also holds the flags
that pertain to the interpretation of its path. There's a slightly
different formulation where it also holds dir_vp and obj_vp, which
makes it a lot more like nameidata - even in that case it's still
a sealed abstraction, but I think I like it this way better.

   namei(pathbuf) {
        dir_vp = NULL;
        while (1) {
                dir_vp = namei_parent(dir_vp, pathbuf);
                obj_vp = namei_once(dir_vp, pathbuf);
                if (is_link(obj_vp) && !is_nofollow(pathbuf)) {
                        dir_vp = namei_follow(pathbuf, dir_vp, obj_vp);
                        continue;
                }
                break;
        }
        return obj_vp;
   }

   namei_parent(dir_vp, pathbuf) {
        dir_vp = namei_begin(dir_vp, pathbuf);
        while (1) {
                pathbuf_parsename(pathbuf);
                if (pathbuf_at_last_name(pathbuf)) {
                        break;
                }
                obj_vp = namei_once(dir_vp, pathbuf);
                if (is_link(obj_vp)) {
                        dir_vp = namei_follow(pathbuf, dir_vp, obj_vp);
                        vn_decref(obj_vp); /* ? */
                } else {
                        pathbuf_consume_name(pathbuf);
                        vn_decref(dir_vp);
                        dir_vp = obj_vp;
                }
        }
        return dir_vp;
   }

   namei_once(dir_vp, pathbuf) {
        if (!is_dir(dir_vp)) {
                ENOTDIR;
        }
        vn_lock(dir_vp);
        check_search_permission(dir_vp);
        obj_vp = VOP_LOOKUP(dir_vp, pathbuf_get_current_name(pathbuf));
        vn_unlock(dir_vp);
        return obj_vp;
   }

   namei_begin(dir_vp, pathbuf) {
        if (dir_vp == NULL) {
                if (pathbuf_use_root(pathbuf)) {
                        dir_vp = root_vp;
                } else {
                        dir_vp = curproc->p_cwd; /* or whatever */
                }
                vn_incref(dir_vp);
        }
        return dir_vp;
   }

   namei_follow(pathbuf, dir_vp, obj_vp) {
        if (pathbuf_note_followed(pathbuf)) {
                ELOOP;
        }
        /*
         * Any remaining path in pathbuf gets shuffled to the end.
         * The link contents are read into the beginning. Then the
         * part at the end is moved backwards over any leftover space
         * in between. If the link doesn't fit, we ENAMETOOLONG. This
         * avoids ever having to allocate during namei. (It won't be
         * quite so easy with TRYEMULROOT, but it's still far simpler
         * than the status quo.)
         */
        (buf, len) = pathbuf_get_symlink_space(pathbuf);
        len = VOP_READLINK(obj_vp, buf, len);
        pathbuf_accept_symlink(pathbuf, len);
        vn_decref(obj_vp);
        if (pathbuf_use_root(pathbuf)) {
                vn_decref(dir_vp);
                vn_incref(root_vp);
                return root_vp;
        }
        return dir_vp;
   }

vn_incref and vn_decref are basically vget and vrele, in case that
wasn't clear. Likewise, vn_unlock does the obvious thing.

Some points to note:

 - There's one complication that I've left out of the pseudocode:
namei_once needs a flag that tells it to leave dir_vp locked if
VOP_LOOKUP gives ENOENT. This is used by open to make O_EXCL work.
This is the only major wart in the interfaces.

 - The way this pseudocode is written, namei_parent is passed a
starting dir so namei can call it in a loop. The intended public
interface for code outside of the namei implementation won't allow
that.

 - This code does not hold the lock on dir_vp across reading a
symlink. Consequently, for a relative symlink it will relock the
directory for the next lookup and conceivably get inconsistent
results. Things could be adjusted to hold the lock at the cost of
additional complexity. However, I'm not convinced this case matters
that much, and I think the long-term solution to this and other
problems, like the one I alluded to above in link(), and the issue
with stat(), is to add an additional layer of timestamp-style
synchronization to vnodes; this would allow killing and restarting
*any* inconsistent operation, including those involving lots of "."
and "..". (If you don't know how timestamp synchronization works, I
can explain; or check a database textbook.)

 - In the above code, namei_follow consumes obj_vp. This is maybe
wrong, and maybe not; one could argue it either way, because on the
one hand what namei_follow *does* is effectively consume the fact that
we've hit a link and read it into the path buffer, but on the other
hand having random functions consume references is usually not a good
idea. I will think about this some more

 - Note that all VOP_LOOKUP is passed in this world is the directory
and the name string to look up. I'm aware that e.g. ffs needs to be
able to store an off_t for later use by VOP_CREATE. Arranging this
will cause some additional ugliness but I don't think it'll be that
bad. (It can legitimately be stashed inside the pathbuf structure...)

 - The pathbuf operations are:

        pathbuf_parsename - inspect path and identify next component
        pathbuf_at_last_name - is current component the last one?
        pathbuf_get_current_name - retrieve the current component as a string
        pathbuf_consume_name - drop current component, prepare for next
        pathbuf_use_root - check for leading slashes and consume them
        pathbuf_note_followed - count number of symlinks crossed
        pathbuf_get_symlink_space - see comment in code
        pathbuf_accept_symlink - see comment in code

plus (not used above)

        pathbuf_strdup - create pathbuf from kernel-space string
        pathbuf_copyin - create pathbuf from userspace string
        pathbuf_destroy - clean up pathbuf

None of these names are yet meant to be cast in stone.

 - In the long run it may be desirable to pass dir_vp to
pathbuf_parsename, so that inspecting the beginning of a pathname and
choosing how much of it to lop off and send to VOP_LOOKUP could be
made a vnode op on dir_vp. This would allow research fses that have
alternate naming models.

 - I should perhaps note that in this world the need to advise namei
of what operation you are (lookup vs. mkdir or whatever) is gone, and
most of the namei flags can be dumped. Certainly the gross ones like
SAVESTART will go away; I *think* all we need is FOLLOW/NOFOLLOW and
TRYEMULROOT. I'm not completely certain of NOCACHE and the
whiteout-related flags.

 - I haven't decided exactly how to expose the remaining flags. They
could be an argument to the pathbuf creation functions; there could be
a pathbuf_set_flags call; or there could be individual pathbuf_set
calls for each of the flags. If there are only a few the latter might
be best. The primary criteria are that I'd like to make it so random
passersby don't have to know which way the defaults are, and to avoid
having code where silly mistakes aren't caught by the compiler. (Thus,
definitely nothing like "pathbuf_set_options(pathbuf, 1, 0, 1, 0)",
although intermediate steps may have this form temporarily. See
below.)


This is how open() works in this world:

   vfs_open(pathbuf, flags, mode) {
        while (1) {
                dir_vp = namei_parent(pathbuf);
                obj_vp = namei_once(dir_vp, pathbuf /* with special flag */);
                if (ENOENT && (flags & O_CREAT)) {
                        check_write_permission(dir_vp);
                        obj_vp = VOP_CREATE(dir_vp,
                                        pathbuf_get_current_name(pathbuf),
                                        _S_IFREG, mode);
                        vn_unlock(dir_vp);
                        vn_decref(dir_vp);
                        break;
                } else if (ENOENT) {
                        /* was left locked because of special flag */
                        vn_unlock(dir_vp);
                        vn_decref(dir_vp);
                        ENOENT;
                } else if (flags & (O_CREAT|O_EXCL)) {
                        vn_decref(dir_vp);
                        EEXIST;
                } else {
                        if (is_link(obj_vp)) {
                                dir_vp = namei_follow(pathbuf, dir_vp, obj_vp);
                                continue;
                        } else {
                                check_access_permission(obj_vp, flags);
                                vn_decref(dir_vp);
                                break;
                        }
                }
                /* NOTREACHED */
        } /* end of while */
        /* now have the desired vnode in obj_vp */

        /* ... do stuff like O_TRUNC here */

        return obj_vp;
   }

If anyone would like to look at the actual code from my research
system, I can post it. (pooka@ already has a copy.)

And now...


How I'm intending to get there
------------------------------

I could just sit down and type up the previous in full detail, then
just switch everything over. (I've seriously considered this. It would
be less work.) But since there are several features I haven't yet
designed in support for, I don't think this is a good idea. Plus it
would be a very big change all at once (including a lot of code in
every fs, because I'm changing quite a few of the VOPs), it's unlikely
I would have much confidence in the results, and it would take a long
time to tidy up all the fallout.

I've also thought about doing that, but keeping the old VOPs and
switching system calls over from the old code to the new one at a
time. Then after everything is switched, the old VOPs could all be
g/c'd. This would be less intrusive, but it could introduce some
really weird inconsistencies. Also, struct componentname is exposed to
a lot of VOPs and cloning all of those temporarily would make a pretty
big mess. So I think this is actually a worse idea than switching
everything over all at once.

So I think the best thing to do is make it incremental. Very
incremental. (Note that I've already set out a couple times just to
switch callers over to some aspects of the intended new interface, and
my sense for code cleanup has told me I needed to stop because I was
changing too much too fast.)

The first step, I think, is to weed out all the simple callers of
namei. By this I mean those that are LOOKUP, use no flags other than
FOLLOW/NOFOLLOW and TRYEMULROOT, and don't do anything afterwards
besides extract the result vnode. These can be converted to a simpler
form: namei_simple_kern() and namei_simple_user(), one for kernel
pointers and one for userspace pointers. These functions as I've
currently set them up take four arguments: the path, *separate*
boolean (zero/nonzero) arguments for FOLLOW and TRYEMULROOT, and a
return pointer for the result vnode. These calls then won't have to be
touched for quite a while as other stuff changes under them.

(Note that using zero/nonzero arguments for FOLLOW and TRYEMULROOT
rather than the flags is intentional. Even though as it's possible to
accidentally use the wrong order, I've made all the calls use the
forms [01]/*(NO|)FOLLOW*/, [01]/*(NO|)TRYEMULROOT*/ so they can be
pretty readily checked. This way these calls are insulated against
*all* changes in the namei interface, including possible
rearrangements of what the flags mean. The idea is that after
everything else is done I'll go through and convert them all to the
new form of namei, or some suitable alternate form of namei_simple
with a better flags interface.)

I have done this much already (but not committed it) and it takes care
of just about half the namei call sites in the kernel; so it's a
significant step forward.

There are quite a few more calls that are still the simple form except
they also use LOCKLEAF. I think the second step is to change these to
call namei_simple_* and then vn_lock().

This much I don't think should be done on a branch - the changes are,
while widespread, not that invasive and fairly easily validated by
inspection. Plus, if I break something it is best found as soon as
possible; and if I break something it'll probably be something
obscure, in which case if it's on a branch it'll never be found anyway
until the branch is merged, at which point there'll be many more
potential culprits.

After this point it becomes less clear what the next step is. I think
that since the members of nameidata and componentname are basically
functioning as global variables, the way to proceed is to clean them
up the way one would clean up global variables: rename them or move
them into abstractions one by one and see what stops compiling. Doing
this gradually will create churn, but it'll also offer confidence that
nothing's gone seriously wrong.

The real question at this point is how much I want to try to get into
5.0. It would be very desirable to get it *all* into 5.0, because even
if it still needs a shakedown when the branch point arrives, it'll be
a lot more robust and maintainable than what we have now. However,
it's now the middle of March and in the past four months I haven't had
that much time to work on it and haven't really made all that much
progress. While this is partly because I've had to start over several
times because I was getting in too deep too fast, it's not clear that
trying to be done before the 5.0 branch is reasonable. And it would
definitely not be a good idea for 5.0's namei to be frozen into some
random intermediate state that might or might not have fundamental
self-consistency problems.

I think the best plan for now is to go ahead and commit the first two
steps described above to HEAD; they will not have any serious adverse
effects on 5.0 even if nothing else gets in. Then I think I will hack
for a while on a local branch, and if I'm making rapid progress or
discover the existence of a good intermediate state, I can start
feeding things to HEAD one bit at a time. Otherwise, I can sit on it
locally until after the 5.0 branch point.

I don't think there's any real point in creating a CVS branch for
this; I'm assuming nobody else is likely to want to hack on it, and
readonly access for testing can be arranged easily enough with
Mercurial or as patches. I'm not convinced that having it available
from CVS for testing would get it any more testing; plus, this way I
can realistically feed it to HEAD in small pieces, which should
minimize or at least spread out any fallout.

Anyone who's maintaining an out-of-tree FS may want to get in touch
with me. If you're doing unclean things with componentnames I can't
promise to continue to support whatever it is you're trying to
accomplish; but if you're doing something reasonable but unusual that
I might overlook I'd rather hear about it now than six months down the
road.

-- 
David A. Holland
dholland%netbsd.org@localhost
Follow-Ups:
- Re: Cleaning up namei
  - From: Greg Troxel
- Re: Cleaning up namei
  - From: Bill Stouder-Studenmund
Prev by Date: Re: Replace lockmgr for vnodes
Next by Date: Re: SPL NOT ZERO ON SYSCALL ENTRY
Previous by Thread: Pools vs kmem_alloc [Was: CVS commit: src/sys/kern]
Next by Thread: Re: Cleaning up namei
Indexes:
Home | Main Index | Thread Index | Old Index