Subject: Re: But why?
To: None <perry@piermont.com>
From: David S. Miller <davem@caip.rutgers.edu>
List: tech-kern
Date: 10/23/1996 19:59:17
   Date: Wed, 23 Oct 1996 19:37:37 -0400
   From: "Perry E. Metzger" <perry@piermont.com>

   > You're missing my point just to be rude and obnoxious.

   No, I got your point. The point was that your point is
   meaningless. Saving 20 instructions here and 10 instructions there
   is useless except in the inner loop of DES or an MPEG decoder.

I disagree for common code paths (ie. trap entry and exit, think
interrupt latency).

   > (I'm quoting Jacobson here, when he heard someone at a conference
   > say that TCP could never be made to go fast) "Bullshit!"

   I didn't say things couldn't be optimized. I said the kernel isn't
   a bottleneck.

   On networking, well, my kernel already totally saturates my
   ethernet without taking significant CPU time. This means there is
   no point in optimizing it further. When I go for 100Mbps ethernet,
   then it will be time.

I can saturate a CISCO 100MB/s ether switch on an SS10 with Linux
these days.  I'm up to 100Mb/s, FDDI, and soon ATM on my SparcLinux
machines, so this is a significant concern for me.

   Sure, if you find that something is a real problem (like you have a
   100Mbps ethernet and you are getting 11Mbps out of it) you should
   optimize. However, there is no point if you don't see a problem.

You're getting the problem even over 10MB ether, you just haven't
measured it.

   > "We have faster boxes, we can afford to let the kernel go a bit
   > slow
   >  because the hardware is faster now."

   Are you being deliberately thick?

No you are, let me show you.  Let us follow your logic here.

   Most of the machines I use spend only a few percent of their lives
   in the kernel.

"These jobs are not in the kernel much."

   Wall Street applications, which you explicitly named, are very much
   like this. Compiles, which my machines do a lot of, are bound on
   CPU in userspace and I/O during lexing. The kernel time?
   Negligible. Maybe a few percent.

"These jobs are I/O bound during lexing."

   On things like tickerplants, once your VM is decent, what you need
   is lots of memory, because memory ends up being a giant disk cache,
   and lots of fast disk, because memory can't store
   everything. Usually the CPU doesn't tick over on these machines --
   they are almost always I/O bound.

You need fast disk, and fast disk cache from memory.  Well, I don't
care what kind of mmu architecture your machines are, you are faulting
a large percent of the time.  And I don't care how fast your VM can
service the page fault, you are taking trap overhead every single time
a page translation is not in the mmu or the cpu cache is not hot or
has been singed by bloatOS footprint.

And if you need to start going out to disk, guess what you better have
a fast interrupt path to move the data quickly.  Every cycle extra
your system takes to start and eventually finish this swap or disk
access means cycles your large data set process sits in wait.

   The software crisis is that software can't be written fast enough
   to satisfy the needs we have. The software is usually fast enough.

There's one for my quote file.  "The problem is that software can't be
written fast enough. The software is usually fast enough."  If it was
usually fast enough we would not be in a software crisis!

   Some applications, like desktop video, are still cutting edge CPU
   wise, but these rarely even touch the kernel.

Ask people who really work on this and use interrupts serviced in user
mode, or kernel resident tasks to get acceptable performance for these
activities.  You better be task switching fast as well.

   The only place the kernel is still a bottleneck is in IPC,
   especially in network stuff. Removing excess kernel copies from the
   IP stack is important if we want to use things like ATOMIC as our
   lans of the future, running at 600Mbps or 1.2Gps. However, for most
   stuff, the kernel isn't noticed.

Duh, the whole origin of my argument and this entire discussion was
Linux's networking bandwidth and latency capabilities.  The choir
listens to you preach.

   Take disk I/O for example. Most decent operating systems already
   get virtually every last drop of I/O the disk is capable of. Sure,
   maybe you can chop a few instructions here or a few instructions
   there -- but why bother? No point. If you hit the wall, you need
   faster disk, or better cache or layout algorithms, or striping --
   shaving an instruction or two on the kernel won't help you.

Don't drop latency from this example.  Once you even do get the data
off the disk, however fast, how quickly can you get the context and
the data itself back to the task that asked for it in the first place?
Again, this is your memcpy(), csum_partial_copy() code in assembly,
your switch overhead, and your trap/system-call entry/exit costs.

I set behind the wheel every day watching where tweaks matter, where
the big payoffs are.  And my message to everyone is that "hey, you're
forgetting locore dudes, look at how it pays off for me"  Your system
takes at least 100 clock interrupts every second, every processes does
some number of system calls during it's lifetime, page faults happen
for processes that do only one or two system calls.  Your path in and
your path out is therefore critical to _all_ of your performance.
Kernel cache footprint, for all tasks, is a difficult to quantify yet
known to be a significant overhead.  Believe me, I've talked to people
who hook up the scope to the CPU in the labs and watch every cycle,
trap, and cache miss.  This is a real concern.

David S. Miller
davem@caip.rutgers.edu