On Wed, Mar 19, 2008 at 05:37:09AM +0000, David Holland wrote: > As some of you may know or have heard, I have been loitering with > intent near the namei code lately. It is a horrible mess, it is > causing fairly serious problems by being a mess, and I want it to > stop. > > > What I'm intending to do/where I'm going > ---------------------------------------- > > My basic intent is to abolish struct nameidata and struct > componentname entirely. Right now, the contents of these structures > function more or less like global variables; this is the primary > reason all of the namei-related code is so dodgy. Sounds good. > I am also going to be pushing work up out of namei to the callers; > another major problem is that namei tries to do everything for > everybody, and so pieces of the logic of various operations are > embedded in it when they really ought to be separate. This may lead to > some code duplication in the short run. (In the long run I'd like to > create a family of vfs_* and vn_* functions that provide all the file > ops, thus drawing a distinction between the vfs code and the system > call layer; this way compat code and nfsd and other things can sit on > top of vfs much more cleanly. But this isn't happening right away.) > > The basic model I have in mind is that there will be two main calls: > namei() and namei_parent(). The former will go path -> vnode; the > latter will go path -> directory vnode and last name component, which > is the basic building block for open/creat, mkdir, rmdir, link, > unlink, rename, and so forth. There will also be sub-operations > namei_begin(), namei_once(), and namei_follow(). > > I am also intending to introduce a struct pathbuf, which is an > abstraction to hold a pathname - the primary value of this is that > because it is known to always be size PATH_MAX the namei code can > reuse it to follow symlinks, which turns out to simplify allocation a > good deal. But it also makes the copyin() code for pathnames more > stylized, and thus easier to both crosscheck and, whenever we're ready > to do so, generate automatically. > > (I discovered the value of struct pathbuf accidentally while hacking a > research kernel; I was tinkering with namespaces and wanted to > abstract away the structure of a filename so it could be a string or a > namespace id and a string; the research proved a dead end, but > refitting my vfs code to use struct pathbuf in order to support it > turned out to make it a good bit cleaner. So I think it's a good idea > in general.) > > The other major change I intend is that I'm going to simplify the > locking. While namei will hold the lock of a directory while looking > in it (which is necessary), it will not hold more than one lock at a > time and it will never (with one minor exception I'll discuss later) > return a locked vnode. > > This specifically abolishes the traditional lock coupling that we (and > ~everyone else) uses while searching down the tree. There is no need > for it: all it prevents is, if two processes are following the same > path, one catching up to and passing the other on the way down. In > general this is harmless, because lookup is a read operation and read > operations commute freely with one another. I'm not sure I like this. Well, I don't mind namei_parent() returning an unlocked node. I however think it would be better to keep the lock-walking we do now and just unlock at the end. My understanding was that when we replaced lockmgr locks for vnodes, we used reader-writer locks. If I understand correctly, we could use reader locks while walking down. I think retaining the same locking behaviors for the most part will reduce the chances of problems in the transition to the new scheme. So I recommend it. :-) > It is not quite as clear that it's harmless if it happens at the > bottom of a path, when one or both of the operations in progress are > directory writes of some kind (e.g., mkdir and rename) - but in this > world these cases all do namei_parent() and then explicitly lock the > parent directory until they're done, so the following operation has to > wait until the first completes. > > So far the only case I can think of that even might pose a problem is > that it becomes possible to stat() a pathname, succeed, and get back a > struct stat that says the link count is 0. I am not convinced that > this is a serious problem, but if it is, we can easily hack around > it. (However, I have discovered some other races that the lock > coupling doesn't prevent; e.g. you can get formally inconsistent > results out of link() if the timing is just right. And of course > exact consistency is a lost cause for paths that involve "..".) > > (I have been thinking about preparing a formal model and a proof that > the lock coupling isn't necessary, because the supposed need for it is > a long-standing piece of kernel lore and therefore lots of people are > skeptical. But so far I haven't, partly because it's work and partly > because the existence of other races like the one in link will make > proving anything useful difficult.) You're talking about a wide-sweeping change to the kernel. I suggest you leave the lock stepping as-is for the most part just to improve the chances of a successful transition. If you can make this proof, you can always change the locking later. > Anyhow. I have written and tested this VFS design in a research kernel > and to the best of my knowledge it handles all the basic cases > correctly, including things like create-follows-symlinks. It is not > yet clear exactly how much extra widgetry will be needed to support > things my research kernel didn't have, like TRYEMULROOT and > MOUNT_UNION, whiteouts, and compat code that does odd things. I don't think they're that hard to add. Please leave them in. ;-) > However, here's what the basic parts look like in pseudocode without > error handling. Note that pathbuf is stateful and also holds the flags > that pertain to the interpretation of its path. There's a slightly > different formulation where it also holds dir_vp and obj_vp, which > makes it a lot more like nameidata - even in that case it's still > a sealed abstraction, but I think I like it this way better. > > namei(pathbuf) { > dir_vp = NULL; I think you'll have more success making "start from the top" be an explicit action. Emulation becomes easier because you could try "under emulation" then "not under emulation". Doing it all in namei_begin() can make state-keeping challenging. > while (1) { > dir_vp = namei_parent(dir_vp, pathbuf); > obj_vp = namei_once(dir_vp, pathbuf); > if (is_link(obj_vp) && !is_nofollow(pathbuf)) { > dir_vp = namei_follow(pathbuf, dir_vp, obj_vp); > continue; > } > break; > } > return obj_vp; > } > > namei_parent(dir_vp, pathbuf) { > dir_vp = namei_begin(dir_vp, pathbuf); > while (1) { > pathbuf_parsename(pathbuf); > if (pathbuf_at_last_name(pathbuf)) { > break; > } > obj_vp = namei_once(dir_vp, pathbuf); > if (is_link(obj_vp)) { > dir_vp = namei_follow(pathbuf, dir_vp, obj_vp); > vn_decref(obj_vp); /* ? */ > } else { > pathbuf_consume_name(pathbuf); > vn_decref(dir_vp); > dir_vp = obj_vp; > } > } > return dir_vp; > } > > namei_once(dir_vp, pathbuf) { > if (!is_dir(dir_vp)) { > ENOTDIR; > } > vn_lock(dir_vp); > check_search_permission(dir_vp); > obj_vp = VOP_LOOKUP(dir_vp, pathbuf_get_current_name(pathbuf)); > vn_unlock(dir_vp); > return obj_vp; > } > > namei_begin(dir_vp, pathbuf) { > if (dir_vp == NULL) { > if (pathbuf_use_root(pathbuf)) { > dir_vp = root_vp; > } else { > dir_vp = curproc->p_cwd; /* or whatever */ > } > vn_incref(dir_vp); > } > return dir_vp; > } > > namei_follow(pathbuf, dir_vp, obj_vp) { > if (pathbuf_note_followed(pathbuf)) { > ELOOP; > } > /* > * Any remaining path in pathbuf gets shuffled to the end. > * The link contents are read into the beginning. Then the > * part at the end is moved backwards over any leftover space > * in between. If the link doesn't fit, we ENAMETOOLONG. This > * avoids ever having to allocate during namei. (It won't be > * quite so easy with TRYEMULROOT, but it's still far simpler > * than the status quo.) > */ > (buf, len) = pathbuf_get_symlink_space(pathbuf); > len = VOP_READLINK(obj_vp, buf, len); > pathbuf_accept_symlink(pathbuf, len); > vn_decref(obj_vp); > if (pathbuf_use_root(pathbuf)) { > vn_decref(dir_vp); > vn_incref(root_vp); > return root_vp; > } > return dir_vp; > } > > vn_incref and vn_decref are basically vget and vrele, in case that > wasn't clear. Likewise, vn_unlock does the obvious thing. > > Some points to note: > > - There's one complication that I've left out of the pseudocode: > namei_once needs a flag that tells it to leave dir_vp locked if > VOP_LOOKUP gives ENOENT. This is used by open to make O_EXCL work. > This is the only major wart in the interfaces. > > - The way this pseudocode is written, namei_parent is passed a > starting dir so namei can call it in a loop. The intended public > interface for code outside of the namei implementation won't allow > that. > > - This code does not hold the lock on dir_vp across reading a > symlink. Consequently, for a relative symlink it will relock the > directory for the next lookup and conceivably get inconsistent > results. Things could be adjusted to hold the lock at the cost of > additional complexity. However, I'm not convinced this case matters > that much, and I think the long-term solution to this and other > problems, like the one I alluded to above in link(), and the issue > with stat(), is to add an additional layer of timestamp-style > synchronization to vnodes; this would allow killing and restarting > *any* inconsistent operation, including those involving lots of "." > and "..". (If you don't know how timestamp synchronization works, I > can explain; or check a database textbook.) I was about to suggest something that I think is about the same. My suggestion is that we add a counter to a struct vnode and if it's a directory, we incirment that counter each time we do something that changes its name space. Like create, rename, or delete of something in it. My suggestion is to add it as part of this work. Not everything has to USE it yet, but make it part of the new name buffering. Among other things, this will make the "relookup" stuff done in rename much easier. If the counters (current and saved) are the same, just keep going. If they differ, restart the whole thing. > - In the above code, namei_follow consumes obj_vp. This is maybe > wrong, and maybe not; one could argue it either way, because on the > one hand what namei_follow *does* is effectively consume the fact that > we've hit a link and read it into the path buffer, but on the other > hand having random functions consume references is usually not a good > idea. I will think about this some more > > - Note that all VOP_LOOKUP is passed in this world is the directory > and the name string to look up. I'm aware that e.g. ffs needs to be > able to store an off_t for later use by VOP_CREATE. Arranging this > will cause some additional ugliness but I don't think it'll be that > bad. (It can legitimately be stashed inside the pathbuf structure...) Yep, as above. Store both an off_t and a vnode change counter. > - The pathbuf operations are: > > pathbuf_parsename - inspect path and identify next component > pathbuf_at_last_name - is current component the last one? > pathbuf_get_current_name - retrieve the current component as a string > pathbuf_consume_name - drop current component, prepare for next > pathbuf_use_root - check for leading slashes and consume them > pathbuf_note_followed - count number of symlinks crossed > pathbuf_get_symlink_space - see comment in code > pathbuf_accept_symlink - see comment in code > > plus (not used above) > > pathbuf_strdup - create pathbuf from kernel-space string > pathbuf_copyin - create pathbuf from userspace string > pathbuf_destroy - clean up pathbuf > > None of these names are yet meant to be cast in stone. > > - In the long run it may be desirable to pass dir_vp to > pathbuf_parsename, so that inspecting the beginning of a pathname and > choosing how much of it to lop off and send to VOP_LOOKUP could be > made a vnode op on dir_vp. This would allow research fses that have > alternate naming models. > > - I should perhaps note that in this world the need to advise namei > of what operation you are (lookup vs. mkdir or whatever) is gone, and > most of the namei flags can be dumped. Certainly the gross ones like > SAVESTART will go away; I *think* all we need is FOLLOW/NOFOLLOW and > TRYEMULROOT. I'm not completely certain of NOCACHE and the > whiteout-related flags. As above, please keep the whiteout-related ones. NOCACHE could be useful, I'm not fully sure how we use it now. The one advantage to keeping the operation is it tells us how we want to cache results. But you can make NOCACHE two flags, one for "Don't cache hits" and one for "don't cache misses". > - I haven't decided exactly how to expose the remaining flags. They > could be an argument to the pathbuf creation functions; there could be > a pathbuf_set_flags call; or there could be individual pathbuf_set > calls for each of the flags. If there are only a few the latter might > be best. The primary criteria are that I'd like to make it so random > passersby don't have to know which way the defaults are, and to avoid > having code where silly mistakes aren't caught by the compiler. (Thus, > definitely nothing like "pathbuf_set_options(pathbuf, 1, 0, 1, 0)", > although intermediate steps may have this form temporarily. See > below.) > > > This is how open() works in this world: > > vfs_open(pathbuf, flags, mode) { > while (1) { > dir_vp = namei_parent(pathbuf); > obj_vp = namei_once(dir_vp, pathbuf /* with special flag */); Hmmm.... I don't like the idea of namei_once() being external to namei. I think what would work better would be to make VOP_CREATE return an error code (as it does now) and have &obj_vp be a parameter. And have create return the vnode if it finds an existing one. i.e. if VOP_CREATE() returns EEXIST, then &obj_vp contains a (referenced) pointer to the found vnode. Also, if O_CREATE isn't there, just do a namei(). > if (ENOENT && (flags & O_CREAT)) { > check_write_permission(dir_vp); > obj_vp = VOP_CREATE(dir_vp, > pathbuf_get_current_name(pathbuf), > _S_IFREG, mode); > vn_unlock(dir_vp); > vn_decref(dir_vp); > break; > } else if (ENOENT) { > /* was left locked because of special flag */ > vn_unlock(dir_vp); > vn_decref(dir_vp); > ENOENT; > } else if (flags & (O_CREAT|O_EXCL)) { > vn_decref(dir_vp); > EEXIST; > } else { > if (is_link(obj_vp)) { > dir_vp = namei_follow(pathbuf, dir_vp, obj_vp); > continue; Hmmm.... Ick on the need for namei_follow() here. > } else { > check_access_permission(obj_vp, flags); > vn_decref(dir_vp); > break; > } > } > /* NOTREACHED */ > } /* end of while */ > /* now have the desired vnode in obj_vp */ > > /* ... do stuff like O_TRUNC here */ > > return obj_vp; > } > > If anyone would like to look at the actual code from my research > system, I can post it. (pooka@ already has a copy.) > > And now... > > > How I'm intending to get there > ------------------------------ > > I could just sit down and type up the previous in full detail, then > just switch everything over. (I've seriously considered this. It would > be less work.) But since there are several features I haven't yet > designed in support for, I don't think this is a good idea. Plus it > would be a very big change all at once (including a lot of code in > every fs, because I'm changing quite a few of the VOPs), it's unlikely > I would have much confidence in the results, and it would take a long > time to tidy up all the fallout. > > I've also thought about doing that, but keeping the old VOPs and > switching system calls over from the old code to the new one at a > time. Then after everything is switched, the old VOPs could all be > g/c'd. This would be less intrusive, but it could introduce some > really weird inconsistencies. Also, struct componentname is exposed to > a lot of VOPs and cloning all of those temporarily would make a pretty > big mess. So I think this is actually a worse idea than switching > everything over all at once. > > So I think the best thing to do is make it incremental. Very > incremental. (Note that I've already set out a couple times just to > switch callers over to some aspects of the intended new interface, and > my sense for code cleanup has told me I needed to stop because I was > changing too much too fast.) > > The first step, I think, is to weed out all the simple callers of > namei. By this I mean those that are LOOKUP, use no flags other than > FOLLOW/NOFOLLOW and TRYEMULROOT, and don't do anything afterwards > besides extract the result vnode. These can be converted to a simpler > form: namei_simple_kern() and namei_simple_user(), one for kernel > pointers and one for userspace pointers. These functions as I've > currently set them up take four arguments: the path, *separate* > boolean (zero/nonzero) arguments for FOLLOW and TRYEMULROOT, and a > return pointer for the result vnode. These calls then won't have to be > touched for quite a while as other stuff changes under them. > > (Note that using zero/nonzero arguments for FOLLOW and TRYEMULROOT > rather than the flags is intentional. Even though as it's possible to > accidentally use the wrong order, I've made all the calls use the > forms [01]/*(NO|)FOLLOW*/, [01]/*(NO|)TRYEMULROOT*/ so they can be > pretty readily checked. This way these calls are insulated against > *all* changes in the namei interface, including possible > rearrangements of what the flags mean. The idea is that after > everything else is done I'll go through and convert them all to the > new form of namei, or some suitable alternate form of namei_simple > with a better flags interface.) I mildly prefer using flags, 'cause we'll probably need flags anyway. But I think changing over the easy ones first is good. > I have done this much already (but not committed it) and it takes care > of just about half the namei call sites in the kernel; so it's a > significant step forward. > > There are quite a few more calls that are still the simple form except > they also use LOCKLEAF. I think the second step is to change these to > call namei_simple_* and then vn_lock(). Sounds good. > This much I don't think should be done on a branch - the changes are, > while widespread, not that invasive and fairly easily validated by > inspection. Plus, if I break something it is best found as soon as > possible; and if I break something it'll probably be something > obscure, in which case if it's on a branch it'll never be found anyway > until the branch is merged, at which point there'll be many more > potential culprits. I think doing the above in head sounds fine. > After this point it becomes less clear what the next step is. I think > that since the members of nameidata and componentname are basically > functioning as global variables, the way to proceed is to clean them > up the way one would clean up global variables: rename them or move > them into abstractions one by one and see what stops compiling. Doing > this gradually will create churn, but it'll also offer confidence that > nothing's gone seriously wrong. > > The real question at this point is how much I want to try to get into > 5.0. It would be very desirable to get it *all* into 5.0, because even > if it still needs a shakedown when the branch point arrives, it'll be > a lot more robust and maintainable than what we have now. However, > it's now the middle of March and in the past four months I haven't had > that much time to work on it and haven't really made all that much > progress. While this is partly because I've had to start over several > times because I was getting in too deep too fast, it's not clear that > trying to be done before the 5.0 branch is reasonable. And it would > definitely not be a good idea for 5.0's namei to be frozen into some > random intermediate state that might or might not have fundamental > self-consistency problems. I'm not sure how to answer your question above. > I think the best plan for now is to go ahead and commit the first two > steps described above to HEAD; they will not have any serious adverse > effects on 5.0 even if nothing else gets in. Then I think I will hack > for a while on a local branch, and if I'm making rapid progress or > discover the existence of a good intermediate state, I can start > feeding things to HEAD one bit at a time. Otherwise, I can sit on it > locally until after the 5.0 branch point. > > I don't think there's any real point in creating a CVS branch for > this; I'm assuming nobody else is likely to want to hack on it, and > readonly access for testing can be arranged easily enough with > Mercurial or as patches. I'm not convinced that having it available > from CVS for testing would get it any more testing; plus, this way I > can realistically feed it to HEAD in small pieces, which should > minimize or at least spread out any fallout. > > Anyone who's maintaining an out-of-tree FS may want to get in touch > with me. If you're doing unclean things with componentnames I can't > promise to continue to support whatever it is you're trying to > accomplish; but if you're doing something reasonable but unusual that > I might overlook I'd rather hear about it now than six months down the > road. One thing that would be good to support is one of the tricks that HFS will pull, which is that to open a named stream, you add "/streamname" to the end of a name. I think the only thing we need to support for this is that VOP_LOOKUP will tell us, "This was the last name component, even though you didn't think it was." I don't think this will impact namei_parent() though, as you don't "create" these forks, you add them to existing files. Take care, Bill
Attachment:
pgpjJn87013ot.pgp
Description: PGP signature