tech-kern: Improving the Unix API

Subject: Improving the Unix API
To: Linux Kernel <linux-kernel@vger.rutgers.edu>
From: Francois-Rene Rideau <fare@tunes.org>
List: tech-kern
Date: 06/27/1999 11:45:26
		Improving the Unix Kernels' API
	A Kernel Discussion with Hacker Robert Ehrlich

Summary: after a discussion with R.E., I submit a suggestion about improving
the API of free Unices with useful features such as open(path,O_NULL);


Dear Free *n*x Kernel Hackers,
   I've been discussing today with old-time Unix hacker (since V6 or so)
Robert.Ehrlich@inria.fr about possible improvements in the design
of Unix APIs in general, and to the Linux kernel in particular.
I'd like to share a summary with you,
since your are the ones fit to implement them or not.
(Comments in parentheses are purely mine).

The starting point of the discussion was an unexplained corruption
of a Linux ext2 fs partition on a friend's machine.
Our common friend (Bernard.Lang@inria.fr) had found
that 4 subdirectories in his large persistent web cache were corrupted:
their type, size, dates, access rights, attributes, etc, were garbled.
Obviously, a directory inode has been filled with random garbage.
As I happened to pass by, I helped Bernard kill the processes
that had files opened on the mounted drive, so as to fsck it.
The corrupted directories were lost.
They happily weren't critical data, but still an annoying thing to lose them
(who knows, maybe some of the files are now lost pearls of the Internet?).
After fsck, Bernard tried to remove the files,
but there remained one that garbled meta-data had made into
a non-existing block device, that would resist rm -f.

On Friday morning (I guess, since Robert wasn't there on Thursday),
Bernard asked help from Robert. Robert tried to figure out what went wrong,
and soon ended up examining a binary dump of the bad block
and reading the kernel source code to understand.
He realized that the device had an immutable attribute.
He tried to change the attribute with open() and ioctl()
(having learnt about the immutable flag and its behavior
by reading kernel sources for rm, and grep'ing for the flag
in the rest of the kernel; he didn't know about chattr;
chattr must do the same, anyway).
However, the problem is that to change the attribute,
you have to open the file before you can ioctl() on it;
but the file didn't exist (a non-existing device!)
and thus couldn't be opened successfully.
Robert had to hand-remove the immutable flag
(I guess, by accessing the relevant block directly).

We met afterwards, before lunch (he did all that during that morning at work,
including diagnosis and correction of the problem by reading the kernel code;
and he didn't know about the existence of lsattr and chattr -- impressive!).

Robert told me that in some Unix flavors of old,
it was possible to open a file by path with a null access mode (O_NULL ?)
granting neither read nor write access,
of value -1 (bytewise? or 2-bit-wise?),
so that adding 1 to the open mode you get 0 for 0_NULL, 1 for O_RDONLY,
2 for O_RDWR, O_WRONLY=2, and you get a 2-bit capability bitmask
for read and write. He argued that it would have been useful
to be able to do that in modern Unices.
An alternative would be to provide additional system calls
to change attributes, as well as for everything that should
be done on files without requiring to open them.

Indeed, the "open without access rights"
is useful not only to modify attributes and do other ioctl's,
but also to effect all operations that should be done w/o the ability
to open for either read or write
(fstat, funlink, ioctl, fchown, fchmod, fsync),
and could be used with new syscalls like
flink (make a new directory link for file given by descriptor),
freadlink (read link from a file descriptor opened with O_NULL),
fexec (execute the binary that we checked), etc.
open(path,O_NULL) allows you to do all these things _atomically_,
without all those nasty race conditions that happen all the time
in absence of it, when you have to check a file,
then use the data from a which ever same-named file happens to be there
between two system calls, without any kernel-enforced way
to ensure the file will be the same at that time.
Of course, you'll want to be able to fcntl(fd,F_SETFL,O_RDWR)
or something equivalent, to upgrade your access mode
on a file you opened with O_NULL.

It looked like the linux kernel did immutability checking at wrong places:
not only you can't modify attributes from a file you cannot open,
but you cannot do it for symlinks, either (actually,
the situation of symlinks with respect to attributes and fstat, etc,
is very peculiar; maybe there should be in open an O_DONTFOLLOWLINK option
when you open in mode O_NULL, so that you can do the equivalent
of lstat on a filedescriptor; again such thing could be a life-saver
when dealing with files atomically in presence of symlinks).
I remember having been very disappointed not being able to chattr +i
symlinks from /etc to /trans/etc or /proc so as to ensure
that given "files" would always point to the zone where I store
machine/network-dependent configuration files
that I generate automatically from templates when I move
from a machine to another, or when I plug my laptop to another network.

The discussion thus evolved into issues of lookup tables in kernel space.
Robert had a case long ago when he optimized tar on a Unix system of old
to use one of those "cheap" tape hardware on which you had little control.
The tape had two speeds, something like 25ft/s and 100ft/s
(I don't remember for sure what speed he said),
but tar didn't work well at 100ft/s.
Indeed, at 100ft/s, the tape drive wouldn't stop right away,
and had to rewind back to where it stopped
(which was done transparently to the user).
So you had to keep your buffers full to avoid
the slowness and unreliability of rewinding too often.
After optimizing tar quite a bit, Robert faced the problem
of the performance bottleneck being open(),
due to the kernel doing lots of namei lookups
(all the more with poor cacheing, at the time),
all the more in absence of fchdir()
(which used to make open() even more of a problem in cp and mv).

As a feature related to open(path,O_NULL) and avoiding race conditions,
I remarked about directory backup that another useful kernel feature
would be locking on opened directories,
so as to do atomic modification/backup/foo of directories
with kernel support rather than using conventional advisory lock files.

About namei() and large directories, Robert suggested
that news servers, and other large databases
(terminfo, that web cache, and many more come to my mind),
should use special database libraries with a well-defined API
(possibly inspired by the filesystem interface),
rather than abuse the filesystem API as they do;
he didn't think that attempts to adapt filesystem code
specifically to every such use would be useful
(plus it's doomed to induce the hell of putting more and more complicated
filesystem code in the kernel for every single application).

Another problem was the ability to change the mount status of a partition
from read-write to read-only or to unmounted,
which is sometimes done automatically and unnicely
in presence of severe corruption,
and that should be doable by root in a nice way
even in absence of severe fs corruption
(small corruption; removable media; service operation; whatever).
The first question was: what to do about processes that had
open file descriptors on the partition with too much access?
Would they get spurious errors? Would they be stopped?
Would they be stopped, or blocked, or killed, either
immediately or at first file access with obsolete rights on partition?
Robert favored the stop at once option.
In any case, handling such things at fd use requires marking fds somehow,
so that operations will not go awry while the partition is not available,
and will resume nicely when it is again available with enough access;
while handling things at once requires that the kernel atomically
(or otherwise consistently) determines what are processes
using the partition so as to kill them,
which in current implementation requires linear lookup and search
over the whole process table to see if they use a fd on said partition.

The latter possibility raised a second question:
a linear lookup of the whole process lookup table
may be quite long for an atomic operation,
so the question is whether it should be allowed to begin with
(the original KISS design of the Unix kernel would answer: "no!"),
and if so, is the kernel interruptible by "real-time" processes
while doing such operations; indeed, in a truly real-time OS,
response time is most important and shouldn't be degraded
by long system operations.
This is an essential question for kernel design.
One "solution" proposed in the past has been micro-kernel design
(I explain in http://www.tunes.org/papers/Glossary/index.html#microkernel
why micro-kernel isn't really the solution to anything).
The real solution is to have a mostly-interruptible kernel
with as fine-grained locking as needed,
which is useful anyway in a multi-processor implementation.
(Real-time support also implies adapted scheduling and priority management,
but this is an otherwise well-known topic well studied in computer science).

Finally, we discussed about saving _and restoring_ the state of a process,
another hack that he did once to preserve a long-winded calculation
from the service shutdown of a big unix computer.
Saving was a "simple" matter of dumping core.
The only right way to restore things atomically on computers
without user-accessible return-from-interrupt was use sigreturn,
simulating the correct argument stack from saved state.
Of course, even if you manage such hack, there remains the problem
of saving the underlying system state of the process:
file descriptors, sharing of resources, etc.
There is no way to consistently restore these without system support
(so that people who want to do it have to go through specialized
infrastructure libraries that keep enough information to (attempt) restore
process connections at restart --
some people @lip6.fr have done process migration this way);
it would be nice that when the system dumps core,
it also saved enough description about filedescriptors
so that restoring connections may be attempted by a core handler
(maybe the system could even dump cores in the format of binaries
dynamically linked to a core-restoring library).

We then discussed about saving and restoring more than a single process:
saving the whole state of the machine so as to restore it later
(which some laptops do in a combination of hardware and firmware).
Not only must core memory be saved (as a useful optimization,
buffers should be flushed, and may or not be dropped from the core dump),
but as a problem similar to the one above,
the underlying system state (i.e. I/O device state) must be saved, too,
which requires save&restore support from every device driver
(by saving a mirror device state in software
when it isn't readable from hardware),
as well as tracking dependencies between device initializations
so as to do things correctly.
Of course, as with processes above, the restoring code
should do consistency checking to ensure that restored state
is coherent with disk state, device state, etc;
in case some inconsistency is detected,
handlers should be called to resolve it by modifying core state;
and in case the inconsistency cannot be resolved,
the state restoration should fail.
As a side effect to a successful save&restore feature,
not only would it be possible to do on every computer in software
what is done only on some laptops in hardware
(and not always satisfactorily, depending on laptops),
it would also be possible to have a "fast start mode",
whereby you'd save the machine in a state ready-to-go,
so as to achieve the fastest-booting OS in the west.

We discussed a lot of topics, but the above is all I remember
that is relevant to the unixish kernels.


I am sorry to bug you with such a long e-mail.
I have not enough time, interest or knowledge
to modify the Linux kernel myself for all these features
(all the less time, interest or knowledge
to modify every existing free unix kernel);
actually I'm involved in implementation efforts for another
completely different OS already (see Tunes in my .sig).
Nevertheless, I see immediate benefit for everyone
in the kernel features I proposed.
I hope that some hackers reading me see benefit in them, too,
and choose to implement some of them in their favorite Unix kernel.
Maybe someone could convince Robert to hack Linux or another kernel.

By posting on all free unix kernel mailing-list I know,
I intend to put free unices in competition as to which
will implement these features first.
I hope this won't launch a cross-mailing-list flamewar.
Please watch what you reply to what recipients.
I am not subscribed to any of the unix kernel lists I'm posting to,
but I'm interested in discussion that this mail may inspire.
I hope that Linux/BSD kernel hackers will implement
the suggested improvements. If not, I'm interested to hear
about reasons why you kernel hackers think it shouldn't be done,
or about the features having already been implemented
in such free Unix flavor.

NB: This mail is my own initiative. I didn't ask Robert E. for consent.
I'm doing my best effort at representing what happened
and what resulted from the discussion with Robert Ehrlich.
Any mistake, any wrong guess, any misrepresentation, any gross error,
any plain stupidity, is purely my own fault. All good ideas are Robert's.

Regards,

[ "Faré" | VN: Ð£ng-Vû Bân | Join the TUNES project!   http://www.tunes.org/  ]
[ FR: François-René Rideau | TUNES is a Useful, Nevertheless Expedient System ]
[ Reflection&Cybernethics  | Project for  a Free Reflective  Computing System ]
Those who do not understand LISP are condemned to reinvent it, poorly.
	-- Faré, without apologies to Henry Spencer.