tech-kern: Re: funlink() for fun!

Subject: Re: funlink() for fun!
To: Matthias Buelow <mkb@mukappabeta.de>
From: Greg A. Woods <woods@weird.com>
List: tech-kern
Date: 07/14/2003 14:30:25
[ On Monday, July 14, 2003 at 18:34:27 (+0200), Matthias Buelow wrote: ]
> Subject: Re: funlink() for fun!
>
> IMHO, the best solution (albeit outside the established Unix framework)
> would be to fully separate operations on directories and the flat file
> system (inodes/device-numbers or equivalents)...

Yes, "file handles" as some folks call them.  They are effectively
vnodes in the *BSD terminology.

>  There would be an
> operation, let's call it lookup() : pathname -> identifier, to
> translate a symbolic pathname into a more or less opaque identifier
> (similar to an fd)

Exposing vnodes to userland is effectively what getfh(2) already does.

>  Open(2) would then take this identifier to
> actually open the file, not a pathname.

We already have fhopen(2), fhstat(2), and fhstatfs().

Of course these calls (including getfh()) are currently restricted to
the superuser because they have not had the necessary ACL semantics
defined for them.

We are missing at least fhchdir(2) (though for the superuser this can be
emulated with, for example:  fchdir(fhopen(getfh(".")))

> which is unique and reversible for both the
> referenced file and the directory entry during the period of its
> allocation to the process.
> 
>  The advantage would be that
> the application would have a handle on the actual directory entry,
> other than the volatile pathname.

You can't do that (directly) in any filesystem that has all of the same
semantics as a unix filesystem.

File names are just pointers to files that exist in special files called
directory files.  By convention the first file (inode #0) in a
filesystem is also the root directory for the namespace we lay over top
of the filesystem.  By convention we have the first two entries in a
directory file point to the directory file itself and the parent
directory file.

However by convention we do not have the filename(s) recorded in the
files themselves and thus the only way to find the name for a file is to
traverse the directory structure until one encounters a name pointing to
the file in question.  Of course since a file may have more than one
name there's never any sure way to know if the name encountered is the
one the user had in mind for such a multi-named file.  Finding all the
names for a file is of course possible (especially since the link count
tells us how many to look for), but it still doesn't help decide which
was indented by the user.

Note that we don't want to try to record the filename(s) in the file
because there would be significantly more overhead and complication to
maintain those "reverse pointers", especially if you consider the number
of possible updates needed in a hierarchical filesystem for an operation
such as "mv /usr /user".  We also don't want to do this because we don't
want to have to have to allocate variable numbers of disk blocks for one
inode (which we would likely end up having to do sometimes if a file had
many names, even on filesystems with large blocks).

Once you begin down the path of designing a filesystem which has as its
major attributes a hirearchical naming scheme with directories and
sub-directories, and which allows multiple hard links to files, then you
must give up on the notion of having a guaranteed and fast way to
determine a file's name when all you have is a file handle, or file
descriptor, inode, or vnode, etc.  You can still find the name(s) for a
file when given one of those index pointers, but the time it will take
depends on the size of the namespace, and anyone who's run "find -inum"
on a very large filesystem will know this can be a very long time.  It's
not impossible -- it's just a lot more painful than we might desire,
especially when we highly desire to do it for something like funlink().

>  One could then use something like
> funlink() on that identifier to delete the directory entry and simulate
> unlink() without having to care for the case that a new entry with the
> same name has been established in the directory in the meantime.

fhunlink(2) faces the same problems as my funlink(2) does.  Inodes can
still have multiple names and their names can still be changed and moved
within the directory tree between the time th handle or descriptor was
obtained and the time the unlink is attempted.

> Unfortunately this poses several problems: some filesystems cannot
> easily produce such an indirection, it has to be emulated on them
> (should be feasible, though)

Unix filesystems, and indeed all hierarchical filesystems which support
multiple hard links and allow file rename operations, cannot produce any
such indirection without forcing great overhead on operations we
currently consider simple and fast for unix filesystems to implement.

The emulation of this indirection, and its possible optimizations, is
exactly what I've proposed for funlink(2) (and would be shared with
fhunlink(2)), and the way I've proposed it only impacts this new
operation, and not as far as I can tell any existing operation.

> and it doesn't work with the current
> established Unix filesystem API.

We already have getfh(), fhopen(), etc.  Continuing on with fhchdir()
would be a natural extension to the established unix API, especially if
ACL semantics were defined for these system calls.  Rewriting open() and
dup() to be library calls sitting atop fhopen() and fhdup() would
probably be possible, though probably not wise.  :-)

>  It somewhat surprises me in hindsight
> that such an approach was not taken in the original Unix
> implementation.

It's not surprising at all if you consider the desire for a hierarchical
filesystem, especially one that supports hard links and rename operations.

-- 
						Greg A. Woods

+1 416 218-0098                  VE3TCP            RoboHack <woods@robohack.ca>
Planix, Inc. <woods@planix.com>          Secrets of the Weird <woods@weird.com>