Subject: Re: mkdir with trailing / (patch proposed)
To: None <tech-userlevel@netbsd.org, tech-kern@netbsd.org>
From: Greg A. Woods <woods@weird.com>
List: tech-kern
Date: 05/06/2002 01:31:35
[ On Sunday, May 5, 2002 at 23:39:26 (-0400), der Mouse wrote: ]
> Subject: Re: mkdir with trailing / (patch proposed)
>
> I still don't see what the problem is.  That's exactly what I always
> thought they already did ("always" = until this discussion); none of
> the cases where the differences matter had shown up on my radar.

The only thing that's a directory _is_ a directory file.  A directory is
just a file.  A file with a special bit set.  It doesn't contain user
data per se, but rather names and pointers which are used by the
pathname translation algorithms to implement the user level hierarchical
view of the filesystem.  The hierarchy is created by arranging those
files into a path of filename components, where all but the last
component must be a file that is a directory.

In user-land, and in the kernel interfaces, the filename components are
usually concatenated into a single string and they are _separated_ by
slashes.  The slash is meaningless in and of itself though.  A pathname
could be a vector of character arrays, either with length specifiers or
NUL terminators, or even slash terminators.  You can have any number of
terminators after every pathname component (so long as the appropriate
storage has been allocated to hold them).  That's why trailing slashes
are really meaningless too.  The last component of a pathname is only a
directory filename if you expect it to be in the context where you
express the pathname.

The only exception is for the root directory.  It is often described
that it's name _is_ "/", and of course it is special because of the way
the concept of the current directory works in unix-like systems.  Of
course you could think of a fully qualified pathname as just a vector of
filename components which starts with an "empty" name.  Indeed that's
the way I've always looked at it (well, maybe not exactly always -- it
took some doing to undo the pre- conceptions I had learned from
superficial understanding of other systems such as Multics).

The namei algorithm described in Maurice Bach's book ("The Design of the
UNIX Operating System") makes this very clear since it doesn't say what
the separator is, or talk about leading and trailing separators or
anything like that.  It simply expresses conceptually the concept of a
pathname component and uses a pseudo-code test for whether there are
more components following the one currently being examined.  Even the
test to determine if the first component refers to the root directory
isn't spelled out.

> Basically, my wetware pathname interpretation engine treats a slash as
> "prepare for another component, path-so-far must name a directory",
> even if the slash is at the end of the string, and I'd assumed namei
> behaved similarly.  (This after compressing multiple slashes.)

If you really want to refer to a file explicitly as only a directory
then you should refer to it by its own unique filename:  "." (with the
appropriate pathname prefix, of course :-).  Of course if you're
expecting to create a directory then the last filename component in your
pathname isn't yet a directory, and in fact you're hoping it's not
anything at all yet.  As for rmdir(2), well you're not supposed to be
able to unlink the "." filename -- rmdir is the only special call in
that it expects the last component filename to be a directory file, but
you can't explicitly refer to it with its "." name.  It does the magic
of unlinking the "." and ".." files for you all in one atomic operation
so as to avoid any race conditions where the hierarchy would be
inconsistently represented in directory files.  If you've ever had a
crash in the middle of an rmdir you'll have found out that fsck has to
fix this inconsistency before the filesystem can be considered "clean".

> The only arguments I've seen advanced against this interpretation are
> 
> (1) Buh-but that's not what $ANCIENT_UNIX_VERSION did!

Not just the ancient versions.  _ALL_ versions of The UNIX Operating
System (tm) from the very earliest we have source code for to whatever
any commercial vendor ships tomorrow.  That's all versions from pretty
much the fabled Epoch right up until some time in the future when/if the
vendors are convinced the new standards are worth the headache.

It's not just AT&T's legacy of code either....

From what I'm given to understand this also many/all(?) Linux versions.
(I should check their source code, but I'm too lazy.... :-)

I'm pretty sure it includes MINIX too (yes, there it is on 10759 in the
original, and still the same in the current release too), and I think it
was true of Tunis (though there we go with ancient code again...  :-).

*BSD is very alone here, and until someone pulled enough weight on the
right IEEE committee, even the published standards were against it.

I only point out the ancient Fifth and Sixth Edition code and its
related documentation to show how fundamental and central this concept
is to the very core of Unix pathname translation.  Pathname translation
is extremely important to Unix and unix-like systems.  As Lions pointed
out it's one of the key algorithms in the entire kernel.

> (2) There exist applications that assume other semantics.
> 
> Of those, (1) is irrelevant, in my opinion; I'd call that a bug in
> $ANCIENT_UNIX_VERSION and would greet the change as I would any bugfix.
> (2) is more serious, but as the Rationale you quoted points out, an
> application cannot count on any existing behaviour to be portable now.

Actually except for *BSD, they can.  Like I show slashes are only
treated as pathname component separators and/or terminators in every
other implementation I'm aware of.

Even in *BSD:  If you go by the description of pathname translation on
p. 222 of "The Design and Implementation of the 4.4BSD Operating System"
very much the same algorithm as described by Bach is again presented
with no discussion of the string representation of a pathname -- only
its existance as a list of components, and a test to determine if the
first given component is the root directory or not.

WOW!  Magical co-incidences!  Pathname translation is also described
(in the same way) on page 222 of "UNIX Internals" by Vahalia.

So it seems it's only modern post-AT&T *BSD that's diverged somehow....

What's scary about the divergence in *BSD is that it has required
user-level changes to core utilities, such as mkdir(1), so that they'll
behave in a way such that a user can specify a trailing slash on a
directory filename!

As for IEEE POSIX 1003.1-2001's new rule, well there's only more
confusion there for sure!

-- 
								Greg A. Woods

+1 416 218-0098;  <gwoods@acm.org>;  <g.a.woods@ieee.org>;  <woods@robohack.ca>
Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>