tech-kern: Re: But why?

Subject: Re: But why?
To: None <travis@EvTech.com>
From: David S. Miller <davem@caip.rutgers.edu>
List: tech-kern
Date: 10/22/1996 13:15:58
   Date: Mon, 21 Oct 1996 19:40:31 -0500
   From: Travis Hassloch x231 <travis@EvTech.com>

   >> bacon@mtu.edu (Jeff Bacon) writes:
   >> 3) Every BSD and SVR4 based system today, except for Linux, has a very
   >>    broken System call mechanism.

   I disagree with "broken", except if you generalize it to "inefficient".
   Your fanaticism is showing :)

I'm not the only one who calls it "broken" openly.  Larry McVoy and
several engineers at SGI (and Sun, both past and present) have heard
my complete argument and standpoint, all of them agree with me that
the BSD/At&T unix way of doing system calls is indeed "broken" just as
I have stated it and that the Linux method is clearly the superior one
that exists today.

   >>    You'd think that when people put together function call conventions
   >>    for a particular processor, the OS people would take a look at this
   >>    and find a way to take advantage of this.  In fact, believe it or
   >>    not, they have not to this very day.

   Actually, OS people have done this A LOT.  Just look at L3.
   Microkernel dudes have put out plethoras of papers on reducing system
   call overhead, particularly since their system calls require two
   kernel-boundary crossings.
   Critical code paths are not unrepresented in the literature.

Then why are the BSD camps and the SVR4 based unices still doing it
the original broken way if the literature makes it so obvious that
this is not the way to do it as you have just mentioned?

   >>     whether you are doing it in the traditional broken UNIX way or the
   >>     clean, fast, and superior Linux way.  First I will show the Linux

   Again, this sort of emotional arguing isn't likely to win Linux any converts.
   The terms "clean" and "superior" aren't supported by your arguments.

Then I will support such claims now.

   It is definitely faster.  However, the thing you haven't answered here
   is "what do you lose by doing it this way?".  The answer is "portability".
   You now have to write the backend of every system call in assembler
   (but see below!).

We don't write any of the backends in assembler, plain old C functions
plus one single assembler entry point (for all system calls) to move
the arguments into place and vector off of the master system call
function pointer table (about 13 or 14 machine instructions in all).

You lose zero portability.  You sir have not even glanced at our
implementation and how it really works before making your statements.
%95 of all our system calls are just normal C functions which _all_
ports call directly and share.  The exceptions are:

1) Fork()
2) Clone()
3) Perhaps a few system call compatability calls which have extremely
   strange semantics on other OS's (and they do funny stuff in locore
   to implement these calls as well).

As I showed in my original posting the implementation is the same size
as the code the BSD/SVR4 derived systems use to get into the "do a
system call" C code.  And in the amount of time they get to this C
code, I am already into the system call (again, in C, and using the
same code every other port uses) doing real work.  Take a look at the
code, you will see how it all works with zero portability problems.

Secondly, as for any other portability problems.  The only argument
that could possibly remain is "you cannot possibly be doing
restartable system calls correctly".  And this is fallacy, we handle
them just fine and in a clean fashion even with the direct procedure
call implementation of system calls we use.  In fact on the sparc I
put _nothing_ on the stack at trap time to perform restartable system
calls correctly.

Larry McVoy has even stated the above publicly.

   >>     basically the same, but step 2 is disgustingly inefficient for

   Heheh, agreed.
   But as far as I remember, NetBSD doesn't do any complicated unpacking,
   but simply writes the args onto a contiguous memory area, but I could
   be wrong -- perhaps there is an extra copy in there.

That alone amounts to 2 or 3 cache misses (maybe more since you need
one extra stack frame to call the C code to set this abortion up on
the stack).  Follow your path to an arbitrary system call from trap
time to when you get to the real C handler.  Check out what your
overhead is, I know what mine is down to the instruction, stack space
used, and average cache misses in both the best and worst case.

I've talked to hardware/OS hacks at SGI about this, and they have told
me that these sorts of cache misses come up like a harrier jet on the
scope in their labs.  It's a fact of life, every single cache miss
counts and must be minimized.

   >> 4) Solaris cannot even do it's own optimizations correctly because
   >>    SunPRO is a broken compiler.
		     ^^^^^^
   Again, I'd say it compiles code just fine.  "Broken" simply isn't
   supported by your argument.  It doesn't have a feature gcc has.
   That's all.

It is broken because it does not allow the programmer to express very
natural things to it, like the way GCC does.  Someone sent me a mail
saying SunPRO allowed "assembler inlining" within C, I looked up the
interface they provide and it is a complete abortion compared to the
GCC one which is clean and architecture independant.

SunPRO may produce better code now and then (Cygnus people will tell
you that gcc's code generation on the Sparc is comparable in nearly
all cases though) GCC has the features and the interfaces for those
features that make it hard to work without it.

How many different processors can the backend of SunPRO's compiler
support from the same sources?  Last time I checked GCC supported
about 24 processor variants.

   >>    gook which has to be written in raw assembly) code can directly
   >>    take advantage of this.  However, the C code cannot do this
   >>    because SunPRO lacks a way for you to tell the compiler that
   >>    "hey you don't need to load things, it's already in these
   >>     hard coded registers"

   Gee, I'm shooting in the dark here, but is it maybe because it's not part
   of the C standard?

Yes, but the C standard allows extensions as long as the compiler
allows one to specify that these extensions are all turned off and
only ANSI C features are allows in the input code.

Furthermore, you argument does not hold, because as I explained above
SunPRO has a (albeit brain damaged) way to do assembler inlining
within C code, and this certainly is not in the C standard as you just
mentioned.  Does SunPRO have an option that causes such features to be
disallowed in the input sources and thus get strict ANSI C
conformance? (I haven't checked, this is why I ask)

   Although I realize you are enthusiastic (and should be, especially if
   you are trying to "convert" people over), beware of marginalizing your
   "competitors" or their products.  You're speaking to a wide range of
   audiences, and calling something broken when it isn't is not a good idea.
   We can all find a feature Linux (or one of it's flavors, anyway) doesn't
   have, can't we?

Ummm....  Then lets put what I have accomplished into proper
perspective ok?  How many years did bell labs and whoever else work on
the SVR4 source based Sun used for putting together Solaris?  Now add
in how many years they worked on that source base to get what they
have now (should be approaching 4 years now).  Now add in all the time
spent on hardware drivers and sparc specific code during the SunOS era
which got grafted or partially rewritten to work under Solaris.  Now
multiply all of that by how many engineers at once were working on it
at a given time.  Sounds like a lot of development, and a long time
spent on it.

Now in the course of say 2 years at this point, I've been able to out
do them.  And I've done it with only a handful of key engineers.  How
many engineers worked on SunOS/Solaris combined?

This says something about the corporate development model, it sucks
and it does not work efficiently.  This is why the free software model
is going at such a rapid pace and will indeed take over
eventually. (mark my words, and there are many people, some of whom
are in major corporations now, who defiantly agree whole heartedly
with me on this)

   [of course, maybe it _is_ faster than not doing it at all..
   have you measured it?]

I was informed (see the SunPRO inlinig "feature" above) that in some
cases they can in fact directly use the optimization I mentioned even
within C code, but judging by how inflexible the interface is, a large
portion of the time they need to use the C stub.

   >>    Now GCC has a way to fully take advantage of such an optimization,
   >>    basically all I have to do is put the following in a header file.
   >>         register struct task_struct *current asm("g6");

   Very, VERY cool!   When was this added to gcc?

It's been there since 3 months after stallmans first gcc release many
moons ago.  (see gcc.info, "C extensions" --> "Explicit Reg Vars" -->
"Global Reg Vars")  I can get the exact date the feature was available
from the master gcc ChangeLogs.

   Of course, you realize that this necessarily machine-dependent.
   It is also limited to globals or autos, from what it appears.
   And if you use it on automatic variables, I would guess, you couldn't
   put it in a header file.  And it has to be a register gcc won't trounce.

If gcc sees that asm directive in one of the declarations it will not
trounce it.  The Sparc (actually any register windowed cpu type) is
the only funny case, and the solution is clean because globals are
visible in any register window, and whats more the Sparc ABI states
that %g6 is not to be used by an Sparc ABI compliant compiler.  The
GCC info documentation for this feature explains what are good
registers to use on many architectures for this purpose.

   I was mulling over how one could get the effect of passing-by-register
   in C, and do so while eliminating or isolating any MD portions into a
   small portion of the code.  This combines the
    1) portability advantage of stack-based passing like BSD,
    with the
    2) speed of register passing like Linux.

See above, the Linux method is portable.  You are wasting your time.

   We have three pieces of data:
   1) the MD locore stuff which is written in assembler
   2) the MI system call stuff written in C
   3) the MD mapping so that (2) can get the data written into registers by (1).

Yuck, yes I know how this works, and it is gross.  You need only #1 as
long as the processor ABI specifies where arguments to C functions are
placed in registers and/or the stack.  All of the ones I know do this,
so only the locore code needs to set this up properly, no disgusting
magic asm crap in all of your system calls.  Just use the arguments to
the function call, they are already there for christs sake!

Maybe I can convince you more, I will show you the Linux
implementation on two architectures.

Sparc:

	mov	%i0, %l6	! save for restartable syscalls
	mov	%i0, %o0
	mov	%i1, %o1
	mov	%i2, %o2
	mov	%i3, %o3
	mov	%i4, %o4
	jmp	syscall_table + %offset_in_local_register, %o7
	mov	%i5, %o5

That is it, and works in all cases and for signal based syscall
restarts.

For the MIPS, the only difference is that because it is possible
(since the MIPS has only 4 outgoing procedure call argument registers,
later versions of the ABI (mips4 I believe) allow for a full set of 8
though) you have to extend the syscall_table to have not only a
pointer to the system call, but also an integer holding the number of
arguments the call needs.  Thus:

MIPS:
	/* No need to move the $a? registers, they are already in
	 * the right place.
	 */
	sub	$nargs, $nargs, 4
	beqz	%nargs, do_it
	 nop

	/* Move remaining args from user stack to kernel stack here. */

do_it:
	jalr	$syscall_addr

This works also on the ALPHA, and the Intel (obviously), the m68k, and
just about every other architecture that Linux has been ported to.
All Linux system calls look like:

asmlinkage int sys_setregid(gid_t rgid, gid_t egid)
{
}

C code, nothing more, and the implementation is very minimal on all
architectures and is clean and is portable, and above all efficient as
hell.

   Write all system calls with no parameters, and use C globals (arg1..argN)
   pass in the data, with their declarations and asm() stuff above isolated
   into a MD header file.

   [ ... ]

Would you please look at the Linux implementation before inventing a
new wheel which doesn't even work as well and has many of the same
problems as what exists in BSD/SVR4 today?  You implementation in idea
1 is much less portable than anything I have ever seen.

   Idea 2

   [ ... ]

Again, not all that great of an idea.

Listen to me, I will say it one more time:

	You can only rely on the ABI procedure call conventions for
	any given processor.  Thankfully all of the processor ABI's
	specify things in such a way that you can setup the arguments
	from locore into the apropriate registers/stack-slots that
	just plain C code works and is portable and allows all the
	features UNIX has today including syscall restarts.

QED.

   I'm not familiar with gcc, so I'm not confident at all that this is
   practical.  I'm guessing potential pitfalls are:
   1) the linker has to be too smart
   2) the compiler has to leave the procedure entry and exit points in
   such a generic state that certain optimizations cannot be performed
   3) you lose the distinction between compiler and linker -- you start
   wondering, "gee, can I do some interprocedural dataflow analysis in
   here" and pretty soon you are requiring every part of the code
   to be resident in order to do those optimizations

We use the standard linkers and compiler output to do system calls the
way we do.  No special magic at all, all of which is very straight
forward and clean and portable.  Please, once again, I urge you to
actually look at how Linux implements things before shooting off your
mouth about what is so bad or wrong about it.  I have looked at the
BSD4.X method, and various SVR4 based implementations very closely,
and know exactly what is going on with both the Linux and the classic
UNIX implementations.  You speak as if you have not done your
homework, please do so before arguing against the Linux ways or
trying to dream up better ways to perform system calls in a portable
and efficient manner.

   >> I hope that explains some of it, and gives people at least some sort
   >> of idea of the kinds of things that makes Linux scream on just about
   >> any hardware.

   Remember folks, maximum velocity is very important, but it ain't the whole
   story.  As one of my friends vitriolized; "sure, if you're one of those
   classless morons who only cares about top speed".  Rude, true, and
   unforgettable.

Portability, cleanliness, and rock solidness matter to me as well.  I
have crashme, multiuser, and high end server uptimes that match the
best of them.  Speed is not the only thing that matters to me.

   I am eagerly awaiting constructive responses.  I don't care about OS
   religious wars, but I'd love to do something to improve the situation for
   one or for all.

I am too, but only if people are going to do their homework and not
approach me with "Linux must be broken and unportable, lets come up
with a new way without even checking out how the Linux implementation
really works"  this is insulting to me and others who have looked
deeply into these problems and have considered all existing ways
including Linux's, such as Larry McVoy who has been doing this for
years.

David S. Miller
davem@caip.rutgers.edu