Subject: Re: But why?
To: Tim Newsham <newsham@aloha.net>
From: Linus Torvalds <torvalds@cs.Helsinki.FI>
List: tech-kern
Date: 10/28/1996 19:41:57
[ duh, I've been away for a week, and find a lot of messages on this 
  subject in my mail-box. I have to reply, simply because I think it's a very
  important issue, and some people have been raising so bogus arguments it's
  truly horrifying ]

Some idiot (I forget who, you know who you are, consider yourself insulted)
said that his machine was idle waiting for disk about 95% of the time, and
that under those conditions it wouldn't make sense to optimize the interrupt
path because it obviously would be a micro-optimization that could at best
give him a 5% performance increase even if one succeeded in optimizing it
_all_ away (which is obviously not possible). 

How d*mn stupid can you get? 

If you count that way, why the hell do you have a machine that is 20 times
too fast for your needs? You'd be better off getting an old i386, and save a
lot of money instead of using whatever machine you use now that is obviously
too fast for you. 

In short, arguing that something is idle most of the time is a truly _stupid_
argument against optimizations. Idle time does not count, period. Never has,
never will. The _only_ thing that counts is how fast you can do something. 

On Wed, 23 Oct 1996, Tim Newsham wrote:
> 
> ok.  say I'm reading 256k at a time and perhaps
> getting 3.5Mbytes/sec out of my filesystem.

Umm? WHO CARES?

As you correctly point out, saving 20 machine cycles for this case is
negligible, because you'll be spending all your time copying or waiting for
the disk. 

HOWEVER, that isn't the issue, and it never has been. IF you're doing 256kB
reads, you may be correct (and as David pointed out, even then you may not be
correct, because interrupt latency does matter, as does interrupt tlb and
cache footprint). 

However, there are lots of things that don't do 256kB reads. In fact, I'd be
surprised if 256kB would be anywhere _close_ to a normal read. I'd say most
IO requests are in the <16kB range, and many of them are much smaller. 

Think X, for example. What do you think is the normal IO size? Is it 256kB?
Or is it even 256B? Nope. It's 256 *bits* - 32 bytes. 

Now you say "X is a special case, 32 bytes happens to be the size of an X
request, most other things don't do that kind of thing". 

You're wrong. X is _not_ a special case, and even if it were I consider
graphics performance to be important enough that I'd like to make those 32
bytes go fast. 

And you know what? The overhead for copying 32 bytes is negligible, if you
compare to the cost of kernel entry and gettng to the networking layer (and
often doing a task switch too, in the case of X - many of the requests will
wait synchronously for an answer from the X server). 

Micro-optimizations do make sense. Some other idiot claimed that it doesn't
make any difference how fast a loop of "gettimeofday()" goes, because that's
only something a benchmark would ever be interested in measuring. 

Guess what? gettimeofday() and the system call entry and exit code actually
shows up pretty clearly on kernel profiles under some circumstances, even if
there is no benchmarking going on. Why? X11 is one reason, again. 
Timestamping every packet that comes in. 

"select()" is even more critical, that shows up whatever you do under X11. 

Now, I want you to understand: throughput has no meaning in these things. 
Reading a 256kB area is a non-issue. The whole thing is not "data-centric",
it's "event-centric". And there latency matters. It matters a lot. And the
latency is _exactly_ in things like getting in and out of kernel mode, and
doing some "simple" system calls or process re-scheduling. Claiming that
system call entry doesn't matter is just plain wrong. 

Now, there are people out there that hate X, and say that X can never be
fast, so why do I even mention "X" and "performance" in the same sentence? 

I'll tell you why: Because I think people who dismiss latency issues are
STUPID people. Lots of UNIX people optimize throughput, and if you do that
and don't feel that latency is important enough, then yes, X will be slow for
you. But I personally feel that latency is very important, and that there
isn't really any reason why X should be all that slow (*). 

(*) X _is_ slow, if you compare against painting directly to the frame
buffer, but X is _fast_ if you consider the fact that it's network
transparent etc. At least on the right system. 

Now, X is not the only example of something where latency matters, but 
it's one that most people know about. 

In fact, I'd say that _throughput_ is the "uninteresting" case from an
optimization viewpoint, because throughput is mostly: 

 - avoid copying
 - avoid looking at the same memory twice (eg do checksum and copy in one 
   go). In fact, avoid looking at it even once if you can.
 - do read-ahead and write-behind if possible, and strive to do it on 
   large contiguous blocks.

Those three points just about covers it when it comes to throughput.  Just
about everybody agrees on the above, and UNIX people will look wise and nod
their heads at the above tree rules. 

Also note how all three rules above are "generic" rules: you can essentially
design a system around them, and you'll be fine both on paper and in
practice. The rules do not contradict each other, so there is little point in
arguing about one rule vs another. 

In contrast, latency is _much_ more difficult, yet it is as important as
throughput. For latency: 

 - avoid extra instructions (= micro-optimizations, like doing inline
   functions to get better register usage and less call overhead)
 - avoid constant loop overhead (= _simple_ algorithms)
 - avoid extra loops (= _better_ algorithms)
 - avoid cache and TLB mis-use

Note that the above rules are no longer independent: a complex algorithm with
good O(n) behaviour may imply bad cache or TLB behaviour and/or more
overhead. There can also be other issues like how well the compiler can keep
things in registers etc (complex algorithms generally require more state). 

This is where a lot of UNIX bigots seem to trip up. It's _unbelieveable_ how
many "sane" people will argue against the above four points because they
think those kinds of optimizations are a waste of time.  They think the three
rules of bandwidth makes up for things. Damn idiots,

		Linus