Subject: Re: But why?
To: None <torek@BSDI.COM>
From: David S. Miller <davem@caip.rutgers.edu>
List: tech-kern
Date: 10/23/1996 22:04:49
   Date: Wed, 23 Oct 1996 19:10:22 -0600 (MDT)
   From: Chris Torek <torek@BSDI.COM>

   Benchmarks are useful because they give you a consistent measure.
   Benchmarks are harmful, however, when the measure they give you is
   not a measure of `real' performance on `real' applications.
   Unfortunately, `real' applications (a) vary from one person to the
   next and (b) rarely work well as benchmarks.

I think lmbench certainly measures 'real' performance, at least for
the application types it was geared for.  (I guess larry can give us
what this "application type" was when he initially wrote it all)

   One problem with optimizing system calls in general is that only
   benchmarks spend a large fraction of time making repeated getpid()
   calls, and speeding up such a benchmark is not useful.

lmbench uses read() on /dev/null if you didn't know.

   On the other hand, applications that are important to someone *do*
   spend a lot of time making, say, read() or write() calls -- and
   making getpid() faster also makes those faster.  The question (for
   which I do not have the answer) is, how *much* faster, and should
   the effort be put into the syscall stub, or into the path within
   the file system read() call?  The time for a read() may turn out to
   be dominated by byte copies that could be eliminated entirely via
   page-mapping (e.g., replace the user's buffer pages with COW pages
   that alias the buffer cache).

I'd say many applications sit around doing:

	a) reading and writing small "protocol control" information
	   over TCP between client and server

	b) doing bulk transfers over tcp

	c) mmap()'ing a file and scanning over large tracts or it

	d) read()'ing from a heavily accessed file, which most likely
	   is sitting in the buffer cache already

	e) fork()'ing and exec()'ing new tasks

	f) switch()'ing from client to server

	g) transferring data via a pipe (see 'f')

I could go on and on, and lmbench measures everything I have mentioned
thus far.  As do some other benchmarks, to different degrees of
accuracy.

   >Alan Cox just devised a way for Linux/SPARC to avoid packet
   copying on >our networking stack ...

   This is not a micro-optimization.  (Neither, for that matter, is
   the `system calls via normal subroutine calls' trick, although this
   is probably not the place to *start* optimizing.)

If you lack a firm foundation (ie. you thought about the trick early
on when you first put the pieces of the system together) it is much
more difficult to go back and "do it later".  At least has been my
painful experience most of the time when I messed up an interface and
had to completely redo it later to get it "right".

   In particular, for applications that spend all their time sending
   bulk network data, eliminating these copies eliminates the place
   they spend most of their time -- a network send is, or should be,
   dominated by the time spent copying those bytes.

I agree.

`If the performance ain't crankin', you're just yankin'.''
	- Steve Alexander

David S. Miller
davem@caip.rutgers.edu