Subject: Re: But why?
To: Tim Newsham <newsham@aloha.net>
From: Linus Torvalds <torvalds@cs.Helsinki.FI>
List: tech-kern
Date: 10/29/1996 01:47:25
[ I already got mail saying that I should have let the thread die. I 
  probably should have. I will, soon, as I'll be going to bed anyway.. ]

On Mon, 28 Oct 1996, Tim Newsham wrote:
> 
> > In contrast, latency is _much_ more difficult, yet it is as important as
> > throughput. For latency: 
> 
> This is exactly why the benefits of latency optimizations
> should be examined first.  Programmer time is a scarce
> resource.  Interesting projects abound.  Are the benefits
> of particular latency optimizations worth more than
> the other projects that could be done?  

Ok, this is certainly a valid concern: the amount of time it takes to do
latency optimizations. I can only agree - it's a bitch to do, and not only is
it hard to optimize, it's also hard to be 100% sure of the results. In many
cases a minor change can have unexpected and subtle side effects like moving
some code across a cache line so that two functions suddenly start thrashing
in a direct-mapped L1 cache, for example. 

On the other hand a lot of these micro-optimizations are "interesting hacks",
so even though they may not make much sense from a marketing viewpoint, they
_may_ make sense from a purely personal viewpoint for some people. I know
that both I and David happen to be of that twisted mentality. 

And I think it actually pays off in the long run. The odd tweak here or there
pays off: if you have a mentality like "every bit matters", the system _will_
perform better even though you may not be able to pinpoint exactly which part
of it is so outstanding.. 

> Linux clearly has people who are interested in doing
> the latency optimizations.  If they want to convince others
> that they should spend programmer time doing the
> same they should quantify the benefits.

I think we can quantify them, but whether it is "enough" to convince people,
that's a separate issue. 

> > This is where a lot of UNIX bigots seem to trip up. It's _unbelieveable_ how
> > many "sane" people will argue against the above four points because they
> > think those kinds of optimizations are a waste of time.  They think the three
> > rules of bandwidth makes up for things. Damn idiots,
> 
> Are you saying it is not a waste of time?  How did you
> come to this conclusion?  Could you elaborate on the
> benefits?

I personally don't just randomly pick out some specific function and decide
to optimize it. That does happen "by mistake" in some cases: if I for some
reason am looking at a function I may end up rewriting it to be better even
though I might be looking at it for some other reason. But in _most_ cases I
do a simple profile of the kernel, and look at what shows up. 

For "throughput" things, the peaks are very clear: clearing pages (when a 
process needs a new page) and copying memory. That essentially dwarves 
anything else.

However, you can actually see the different behaviour of non-throughput
processes very clearly on the kernel profiles. I don't know if others have
quite as simple a profiling system, but with Linux it's really trivial to do
even instruction-level profiles of the kernel, and clearing the profiling
data for another run after having done something else. 

For example, right now my profile looks something like this:

       ...
       282     1.35% c0111348 interruptible_sleep_on
       317     1.52% c015a184 ext2_match
       393     1.89% c011d72c generic_file_read
       503     2.42% c011de24 filemap_nopage
       548     2.64% c012536c get_hash_table
       581     2.80% c011b470 do_no_page
       592     2.85% c0156fb4 ext2_check_dir_entry
       757     3.64% c011b0a8 do_wp_page
       776     3.74% c012375c sys_read
       911     4.39% c010aac0 ret_from_sys_call
       977     4.70% c01111bc wake_up_interruptible
       997     4.80% c012381c sys_write
      1744     8.40% c012c1fc pipe_write
      1874     9.03% c012c01c pipe_read
      1903     9.17% c010aa70 system_call
      2068     9.96% c0110e2c schedule
     20748   100.00% 00000000 total

which essentially just gives you the ticks in a function, followed by a
percentage of total ticks in the kernel, and then the kernel address and a
symbolic name for that address (this is after an uptime of less than two
hours - I have rebooted for a new kernel not too long ago). 

Now, I've run lmbench on this machine, and it shows. A "real" profile never
looks like this, but even a real profile will actually show you where in the
kernel the time is spent. And as I said, I have the tools to do the profile
on an instruction level, and I have used that to good advantage: it's a
_great_ way of finding out where the problems are in reasonably complex
systems like TCP. 

I don't find many obvious hot-spots for "normal" code any more these days,
because I've optimized the things that do show up on profiles.  "schedule",
"system_call" and "ret_from_sys_call" _still_ show up on profiles, and not
just for lmbench runs. They've been optimized almost silly.. The "select" 
group of functions is next on my list, because they show up _very_ clearly
running X11..

Trust me - these kinds of optimizations do pay off. A year ago, BSD people
were laughing at Linux TCP performance, and what I did was essentially to
profile the kernel after having run different types of TCP programs (lmbench
for latency and throughput, "ftp" and "tcpspray" for different throughput
patterns etc). Let's just say that people don't laugh any more.. 

> I know computer science is more black magic than science
> but there is much to be said for knowing the system
> and its weaknesses (where time is being spent, what demands
> are usually placed on it), figuring out what to attack
> based on this information and understanding the results
> (did you get the results you expected?  if not why?).

It's definitely not black magic.

It's a lot of profiling code, and knowing the hardware and the compiler.  It
also needs reasonable "loads", in order to get good profiles. That's why I
like lmbench: I can get good profiles, and I tend to agree with Larry on the
things he tests, so I tend to think that the profile information actually
makes a difference. 

> I would love to see a justification for the changes
> made along with an analysis detailing the benefit
> of the new system over the old system.  Unfortunately
> the closest I've seen so far was:
> 
>   "Linux with these changes ran faster than solaris
>    without these changes"

Hmm.. I can't give you any _really_ specific examples of any specific
optimization, but let me just say that getting good performance in TCP
involved a lot of "macro-optimizations" (avoiding the copies, doing
checksum+copy), but there were _lots_ of the micro-optimizations too. For
example, the 4MB page tables on the Pentium makes a difference. Not a large
one, but it's definitely noticeable. And essentially the same optimization
was the last straw needed to beat Solaris on the sparc, too.

And TCP latency shows up things you wouldn't really expect: the copy time
doesn't matter, so other things start making themselves felt. Like how
quickly the stack can handle ACK's, and even how the read queue is emptied of
packets during the read and how you handle timeouts for delayed ACK's. It all
does show up on the profiles, but obviously there are dependencies between
functions that make it impossible to do "local" micro-optimizations, so you
have to do some "global" analysis too to find out what things you want
out-of-line etc.. 

		Linus