tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Cleaning up namei

On Wed, Mar 19, 2008 at 02:12:36PM -0700, Bill Stouder-Studenmund wrote:
 > > Anyhow. I have written and tested this VFS design in a research kernel
 > > and to the best of my knowledge it handles all the basic cases
 > > correctly, including things like create-follows-symlinks. It is not
 > > yet clear exactly how much extra widgetry will be needed to support
 > > things my research kernel didn't have, like TRYEMULROOT and
 > > MOUNT_UNION, whiteouts, and compat code that does odd things.
 > I don't think they're that hard to add. Please leave them in. ;-)

Don't worry, the whole point is that I'm not intending to remove those
things. Otherwise it would be a lot easier all around to just unscrew
the radiator cap and drive a new vfs underneath.

 > >    namei(pathbuf) {
 > >    dir_vp = NULL;
 > I think you'll have more success making "start from the top" be an 
 > explicit action. Emulation becomes easier because you could try "under 
 > emulation" then  "not under emulation". Doing it all in namei_begin() can 
 > make state-keeping challenging.

Usage of TRYEMULROOT is common enough that I'd rather have it rolled
inside namei. It is also enough like unionfs that it scares me, even
though its behavior is allegedly simple.

What are the semantics of TRYEMULROOT and open with O_EXCL supposed to
be, anyway? (Is this even well defined right now? I suspect not.)

 > >  - I should perhaps note that in this world the need to advise namei
 > > of what operation you are (lookup vs. mkdir or whatever) is gone, and
 > > most of the namei flags can be dumped. Certainly the gross ones like
 > > SAVESTART will go away; I *think* all we need is FOLLOW/NOFOLLOW and
 > > TRYEMULROOT. I'm not completely certain of NOCACHE and the
 > > whiteout-related flags.
 > As above, please keep the whiteout-related ones. NOCACHE could be useful, 
 > I'm not fully sure how we use it now. The one advantage to keeping the 
 > operation is it tells us how we want to cache results. But you can make 
 > NOCACHE two flags, one for "Don't cache hits" and one for "don't cache 
 > misses".

Well, I'm not sure if they'll be needed any more. If they are, they'll
stick around. If they aren't... they won't. (And if I get rid of
NOCACHE and next year we find we want it again, it's easily brought

 > > This is how open() works in this world:
 > > 
 > >    vfs_open(pathbuf, flags, mode) {
 > >    while (1) {
 > >            dir_vp = namei_parent(pathbuf);
 > >            obj_vp = namei_once(dir_vp, pathbuf /* with special flag */);
 > Hmmm.... I don't like the idea of namei_once() being external to namei.
 > I think what would work better would be to make VOP_CREATE return an error
 > code (as it does now) and have &obj_vp be a parameter. And have create 
 > return the vnode if it finds an existing one. i.e. if VOP_CREATE() returns 
 > EEXIST, then &obj_vp contains a (referenced) pointer to the found vnode.
 > Also, if O_CREATE isn't there, just do a namei().

After quite a few rounds of MSG_OOB I believe this is what the
alternate proposal is, using the same notation:

vfs_open(pathbuf, flags, mode) {
        if (flags & O_CREAT) {
                dir_vp = namei_parent(pathbuf);
                while (1) {
                        obj_vp = VOP_CREATE(dir_vp,
                                pathbuf_get_current_name(pathbuf), mode);
                        if (EEXIST) {
                                if (flags & O_EXCL) {
                                } else if (is_link(obj_vp)) {
                                        dir_vp = namei_follow(pathbuf,
                                                dir_vp, obj_vp);
                                } else {
        } else {
                obj_vp = namei(pathbuf);

        /* ... do stuff like O_TRUNC here */
        return obj_vp;

This has the advantage that it doesn't need to lock the vnode from
fs-independent code; it has the corresponding disadvantage that the
correctness of O_EXCL depends on every fs doing correct locking in its

It also requires slightly more magic, in that the obj_vp sent back
from VOP_CREATE must be valid even on (some) errors, which definitely
violates the principle of least surprise, but doesn't require exposing
namei_once(). My method requires namei_once(), but the only magic it
needs is a special on-error locking behavior triggered explicitly by a

I think they're both correct, so it's a matter of interface

Note that namei_once may be required by compat code though, so it may
still need to be exposed, just without the magic flag. There's at
least one compat version of lstat that needs to examine the parent
directory vnode of a symlink in order to inherit permissions, or some
such glop, and to do that you need to do namei_parent to get the dir
followed by namei_once to get the link.

 > > This much I don't think should be done on a branch - the changes are,
 > > while widespread, not that invasive and fairly easily validated by
 > > inspection. Plus, if I break something it is best found as soon as
 > > possible; and if I break something it'll probably be something
 > > obscure, in which case if it's on a branch it'll never be found anyway
 > > until the branch is merged, at which point there'll be many more
 > > potential culprits.
 > I think doing the above in head sounds fine.

Good :-)

 > One thing that would be good to support is one of the tricks that HFS will 
 > pull, which is that to open a named stream, you add "/streamname" to the 
 > end of a name. I think the only thing we need to support for this is that 
 > VOP_LOOKUP will tell us, "This was the last name component, even though 
 > you didn't think it was." I don't think this will impact namei_parent() 
 > though, as you don't "create" these forks, you add them to existing files.

Hmm. This is not entirely trivial, because for rename (and mkdir,
etc.) you need to be able to tell which is the last component before
trying to look it up.

But then again, for nearly all purposes, the fork name *is* the last
component, and the file holding the streams *is* effectively a
directory. So I suspect it'll all just work given implementations for
the directory-like VOPs on multistream regular files. Provided the
VFS-level code doesn't check the vnode type before calling directory
ops, but in general it shouldn't be doing that - vnode types are all
very well but they're not necessarily authoritative.

(I once, long ago, tried to write a FS with polymorphic objects and
more or less had to give up because there were too many assumptions in
fs-independent code. That was Linux though.)

David A. Holland

Home | Main Index | Thread Index | Old Index