Subject: Re: But why?
To: None <perry@piermont.com>
From: David S. Miller <davem@caip.rutgers.edu>
List: tech-kern
Date: 10/23/1996 22:33:23
   Date: Wed, 23 Oct 1996 21:24:52 -0400
   From: "Perry E. Metzger" <perry@piermont.com>

   > And I don't care how fast your VM can service the page fault, you
   > are taking trap overhead every single time a page translation is
   > not in the mmu or the cpu cache is not hot or has been singed by
   > bloatOS footprint.

   The time it takes to move the block off the disk dominates. The
   microscopic amount of time the machine spends in kernelland
   doesn't.

Even for your purely compute/IO bound jobs you are leaving away two
important issues (at least to me):

	a) Consider that your OS it taking say 200 cycles more than it
	   needs to with just trap instruction in/out overhead.  Say
	   also that once your application takes control again, it
	   will take it only 100 instructions to get to the point
	   where it asks the kernel to queue up another disk request.
	   I'd say you are losing here.

	b) If your OS is bloated in it's interrupt path, every single
	   clock tick your application is losing real cache lines and
	   real TLB entries.  This is a fact and I have seen it in
	   action.  A userland memcpy() was faster under SparcLinux
	   but a significant and noticable amount than both Solaris
	   and SunOS because the Linux kernel had much better
	   cache/TLB patterns and footprint.  This is not ever using
	   kernel services just like you mention for your
	   applications.

Again, like Larry and I always say, who are the clocks/cache-lines/tlb
for, the user or the kernel?  Even for your job set it does make a
difference, in ways that are _not_ measurable as "5% system time",
that number is partial bullshit in _real_ terms, because that
calculation does not take into consideration point 'b' I have just
mentioned.

As case in point, I have a researcher here who just crunches data sets
all day long, specifically he does fluid analysis.  All his jobs are
raw crunch jobs in fortran.  I did an experiment with him a few weeks
ago.  We built a binary under SunOS, one single binary for the
experiment since SparcLinux has the compatability code to run it as
well.  His job runs for 4 days on average with the data sets he
currently utilizes.  We ran the same exact binary with the same exact
data set under SunOS, Solaris, and SparcLinux.  We isolated the
machine from any network activity at all, we also killed off all
daemons on the machine for all three OS's before beginning the jobs.
We basically gave it a single user environment.

Guess what, the thing ran 5 hours faster under SparcLinux compared to
Solaris2.5.1, about 5 1/2 hours faster than SunOS4.1.4.  Let's say he
needs to get as many jobs like this as he can done every week, I think
it matters to him the things I have presented.   He's told all of the
ARPA speech processing people upstairs, and they are all asking me
when they can get cycles on my SparcLinux machine so they can be more
sure to meet their deadlines.  Thank you very much.

   > Every cycle extra your system takes to start and eventually
   > finish this swap or disk access means cycles your large data set
   > process sits in wait.

   You don't get it, do you.

   If your machine spends 5% of its time in kernelland, the MAXIMUM
   you can get, even if by some miracle you eliminated ALL time the
   kernel spent, would be 5%. Thats peanuts.

See above about how that 5% number isn't measuring many critical
things.

   The time is all spent waiting for I/O to complete, and assuming you
   are getting the disk to spit out blocks as fast as it possibly can,
   or within a few percent of there, you are done. No amount of buming
   the kernel will ever get you more than another percent or two,
   period.

See above about tlb and cache misses due to general kernel bloat.

   > Don't drop latency from this example.  Once you even do get the
   > data off the disk, however fast, how quickly can you get the
   > context and the data itself back to the task that asked for it in
   > the first place?

   If you are maxing out the disk already, and your machine's pay is
   for an I/O bound program like a tickerplant, you won't notice a
   minor change in latency.

Yes, you will.  Think about intelligent disks using TAG queueing,
where the most things you can queue up to it, the better chances you
have for peak performance.  And your time between "request complete"
and "new request submitted" drastically effects your ability to even
reach these situations.

   Now, as it turns out, I *do* use machine dependent hand coded
   string and memory copy functions on the machines I use a lot,

Here is your very own gold star.

   but that isn't where a big win is.

All in favor say "yes", I personally say "no".  And I have the numbers
to prove it.

   Anyway, as I've said, there are serious limits to what is worth
   doing.

People from some sects laugh at me when I tell them that major device
driver interrupt routines should run right out the trap table in raw
assembly.  Just the other day Matthew Jacob, one of the original
authors of the ESP scsi driver on the Sparcs brought up the same exact
idea to eliminate the "SCSI protocol overhead".  I gained a million
miles of respect for that man in that very instance.

   For networking code, there are significant wins to be had with some
   simple algorithmic changes. However, overall, as I've noted, if the
   kernel is chewing only a tiny percentage of your overall time on
   your system, you aren't going to notice if you optimize it to
   death, because even if you optimized it away entirely you wouldn't
   notice.

Tell the cache and the tlb that, they will laugh at you for hours.

   Most modern machines spend most of their time IDLE.

Bwahaha, many people would beg to differ with you highly.

   What people are looking for is the right environment with the right
   features, not every drop of CPU.

Again, many would beg to differ with you.

   Look at Sun Microsystems.

Yes, look at them, and larry's example of how they nearly took one of
the largest losses ever because their kernel was a bloated beast and
slow as molasses.  It matters in the real world, whether you will like
to entertain such an idea or not.

   I don't know what you do for a living,

I change the world.

   but I've been a lot of places where mission critical applications
   on Unix boxes were the firm's life blood. Not once did I ever hear
   anyone say "gee, HP's boxes have lower interrupt latency" or any
   !@#% like that.

Tell that to the millions of ISP's out there, data mining centers
where an NFS server must be able to keep up with a CRAY over 2
kilometers of fiber.  People doing next generation visualization and
simulation over NFS using reality engines and getting real time, they
might enlighten you to what the "rest of the" real world is doing and
what is important to them.  Why do you think Sun is bragging as of
late that they have Solaris running web servers better, why should
they invest so much in this java thing, the list goes on and on...

David S. Miller
davem@caip.rutgers.edu