Subject: Re: But why?
To: None <dyson@freebsd.org>
From: David S. Miller <davem@caip.rutgers.edu>
List: tech-kern
Date: 10/24/1996 02:38:59
   From: "John S. Dyson" <toor@dyson.iquest.net>
   Date: Thu, 24 Oct 1996 01:21:23 -0500 (EST)

   However, I would definitely say that even though I don't ignore the
   additional overhead of the system calls, but it definitely isn't
   where there is alot of time to be mined.  Scalability of algorithms
   and cache footprints are.

Overhead of system calls is a product of many things, _trap_
entry/exit foot print is one of them.  Here is an example:

Just about every Sparc unix saves both the full 8 in registers and the
8 global register in an exception frame at trap time.  With the highly
typical (for Sparcs) 32 byte cache line size, thats two whole cache
lines of space.

It occurred to me the other day that saving the in registers is such
a waste of time, they are saved via the window overflow/underflow trap
handlers and there are very isolated cases when code "higher up" has
to change those values.  The only frequent case is for system calls
(this is turning into two optimizations here, one for system calls and
one for trap cache patterns, for _all_ traps).

For the system call case, because of my architecture, the code which
needs to modify them has direct access to them (the system call occurs
from the first level kernel trap register window), blamo and in they
go when the system call returns.

The other cases are the nutcases and not the common case at all:

	a) System call restarts during signal delivery

	b) ptrace()

For system call restarts, things work out perfectly, because you need
to flush the register windows all to the stack anyways to setup the
user frame.  A pointer to the kernel window in question (the one with
the in registers) is readily available via a per-task struct ptr.
This case is solved.

The second case is even easier.  Once a parent pokes at a child's
registers or modify's his state, the child has stopped and all of his
state has been saved by that point (ie. the register windows have all
been flushed as a consequence of the childs task switch).  Thus the
same easy access to the ins is available.

Ok, result is that I can save 8 less 32bit registers in my pcb on
every single trap the system takes on any processor.  Net result, one
less cache line per trap that the kernel blows away, 4 less double
stores on the way into the kernel, and 4 less loads on the way out of
the kernel.  It could be possible to not even save the PC's and the
PSR in the frame too, but this would be a bit trickier to pull off.

I haven't implemented this yet, came up with it 4 days ago, but I do
know that no other Sparc based Unix takes advantage of this (at least
to knowledge, if someone knows otherwise please fill me in, I'd love
to know of prior art).

David S. Miller
davem@caip.rutgers.edu