tech-kern: Re: Improving the Unix API

Subject: Re: Improving the Unix API
To: der Mouse <mouse@Rodents.Montreal.QC.CA>
From: Gandhi woulda smacked you <greywolf@starwolf.com>
List: tech-kern
Date: 06/27/1999 11:51:20
On Sun, 27 Jun 1999, der Mouse wrote:

# > Robert had to hand-remove the immutable flag
# > (I guess, by accessing the relevant block directly).
# 
# (clri didn't work?)

Obviously the guy thinks along the lines that you need a file descriptor
to do things to files.  That, or he didn't want to do an fsck on the
partition once he was done.

# 
# > Indeed, the "open without access rights"
# > is useful not only to modify attributes and do other ioctl's,
# > but also to effect all operations that should be done w/o the ability
# > to open for either read or write
# > (fstat, funlink, ioctl, fchown, fchmod, fsync),

You mean like stat, unlink, chown, chmod?

Why in the world are you going to fsync a file with which you haven't
done anything?

The only one up there that makes sense is ioctl.

# 
# funlink makes no sense, unless the fd it takes is the fd of a directory
# and you pass in the name of the entry to be removed - which I imagine
# is not what most people will think when they think of an fd-based
# variant of unlink.  unlink() operates on names, not files, after all.

I've wanted an fclri(fd) which would clear the dev/ino attached to the
fd, but there's no clean way to do that as the system would then have
to search for all instances of that ino on that dev, and that's something
the system has no business doing.  namei (name-to-inode) is necessary,
and also easier than doing the reverse since name-to-inode mapping
is unique (because pathnames are unique) while inode-to-name mapping
is not (because there are hard links, i.e. multiple names can refer
to the same inode).  [read that carefully, it looks contradictory but
isn't.]

# I've often wanted open-with-no-access in conjunction with fchdir().
# This is because you need only execute access to set your cwd to a
# directory, but there's no way to get an fd on a mode-111 directory.

Playing the Daemon's advocate, here...
What use is a descriptor into a directory you can't read?  What's
the point of fchdir(dd->fd) if you can't figure out where you're going
from there?  You may as well use chdir(dir).

# While I think there are ways symlinks could be improved, I don't think
# this is one of them.  I can't see any use for opening a symlink except
# use of write() to atomically make the link point somewhere different,
# and I'd prefer to do that by making symlink() do that when the link
# already exists and some appropriate condition is met.

That's a dicey proposition.  We already have quite a few "appropriate
condition" cases, and I think we want to avoid special-casing a whole
slough of conditions.

# > Of course, you'll want to be able to fcntl(fd,F_SETFL,O_RDWR)
# > or something equivalent, to upgrade your access mode
# > on a file you opened with O_NULL.
# 
# The security weenie in me is _really_ unsure that the ability to
# increase the access modes on an open fd is a good idea.

Nah.  fd's are inevitably associated with vnodes (which don't get freed
until the last close()); if the vnode doesn't map out to the appropriate
permissions, the fcntl() would fail.

# > About namei() and large directories, Robert suggested
# > that news servers, and other large databases
# > (terminfo, that web cache, and many more come to my mind),
# > should use special database libraries with a well-defined API
# > (possibly inspired by the filesystem interface),
# > rather than abuse the filesystem API as they do;
# 
# At least one news system does this now, I think - instead of keeping
# each post in a separate file, it uses one huge file and does its own
# space allocation out of it.

Another problem with making filesystems for news was that you had to tune
cpg and bpi way down and ipg way up.

When fscking the block device was actually possible, it was also faster
to fsck the block device on a device full of symlinks (but that's a horse
of a different colour, I realise...).  Go figure.

# > Another problem was the ability to change the mount status of a partition
# > from read-write to read-only or to unmounted,
# 
# See NetBSD (and presumably other BSD) "mount -o update,rdonly" and/or
# "umount -f".  (Last I tried, the latter didn't work as it should, but
# that's a matter of fixing bugs rather than introducing new features.)

...really?  umount -f always works for me.

There's a bug running around, though, at the end of a shutdown which prevents
me from umounting /var for some reason (fstat shows nothing).

# > Finally, we discussed about saving _and restoring_ the state of a process,
# > another hack that he did once to preserve a long-winded calculation
# > from the service shutdown of a big unix computer.
# 
# I did this once, long long ago, under (I think) 4.3.  I found that I
# couldn't just dump core, though I forget why.  As for the open file
# descriptor question, I punted - I made the relevant call fail unless
# the process had no fds open.

Yeah, there's just no way to restore fd state from saved state, since
that would require locking that particular set of state in core and
if you're going to do that, you certainly cannot shut the system down.
Your program will need to maintain the state of the fd flags as set by
fcntl() and the disposition of the file descriptors (read,write,both)
by logging them as you do them; the position information is about the
only parameter you will be saving at the time of the suspension.

You can't just dump core probably because of all the signal overhead
caused by dumping core will be saved in the core file.  You need to
effectively generate a state file from whence your calculations
can resume.  Someone wrote an undump utility, though, which seemed
to branch around the fault that caused the core dump, so go figure.

# > 4.4 chflags() works fine for UFS and leaves other filesystems to map
# > what they can into the UFS set.  Which is bogus - immutable is not a
# > UFS attribute, it's VFS one.

Well, no and yes.  ("Go not to the elves for council...")  Yes, it's
a VFSATTR.  No, other filesystem do not map into the UFS set.  They map
into the VFS set which is rather a superset of all possible filesystem
attributes.  If the underlying filesystem type supports something in the
VFS, it gets set.  If it doesn't, it gets ignored.  Likewise for something
for which VFS does not account in the underlying filesystem (in which
case it's time to take another look at the VFS structure, but I suspect
we have some room left).

# > As for the opening with no permissions - well, it would make *big*
# > sense if we could narrow down the API and move chown(), chmod(), etc.
# > into libc leaving f-variants in the kernel.
# 
# I really don't like that.  The reasons why are (1) this means you have
# to have an fd free to do them; (2) it triples the number of user/kernel
# crossings involved.

Ditto that.  That is, in principle, the largest part of the mikrokernel
gestalt that really bugs me (the message passing and interpreting seems
harmless enough except for the overhead in contexts that it generates).

Narrowing the API is not a good thing to do for its own sake.  The moves
really have to make sense.  Stuff that does direct operations on system
structures is best left in the kernel and, last I looked, a vnode (fh,
vp, whatever) is a system structure which, while it can be transported
out to userland for perusal (I think...), has no value in being mutated
in userland and passed back in to the kernel for approval on the changes.

# > Extreme variant might include {set,get}sockopt extended to files and
# > doing both *stat and *ch{mod,own,flags} via that.
#
# If done, I think the name should be changed.  They are ?etSOCKopt,
# after all.  I'm not fond of this, though; it amounts to returning to
# using ioctl() for the tasks - albeit with a slightly different name.

{set,get}fopt?

I don't know.  What are you saving besides kernel entry points?  It's
certainly not nicer to program in (functions in userland add call overhead,
and macros don't show up in debuggers, usually).

"{,f}chmod(path, mode);" certainly codes easier than 
"setfopt({path,fd}, SFO_CHMOD|SFO_{PATH,DESC}, mode)".

And "getfopt(path, SFO_NOFOLLOW|SFO_PATH, sp);" vs "lstat(sp);"?  Or
"setfopt(fd, SFO_DESC|SFO_CHDIR, dir)" vs "fchdir(fd);"???

No, thank you.

[I hope it's obvious why the SFO_PATH and SFO_DESC would be necessary in
the aforementioned cases.]

				--*greywolf;