Subject: Re: But why?
To: David S. Miller <davem@caip.rutgers.edu>
From: Perry E. Metzger <perry@piermont.com>
List: tech-kern
Date: 10/23/1996 23:50:47
"David S. Miller" writes:
> Even for your purely compute/IO bound jobs you are leaving away two
> important issues (at least to me):
> 
> 	a) Consider that your OS it taking say 200 cycles more than it
> 	   needs to with just trap instruction in/out overhead.  Say
> 	   also that once your application takes control again, it
> 	   will take it only 100 instructions to get to the point
> 	   where it asks the kernel to queue up another disk request.
> 	   I'd say you are losing here.

Why?

If the disk I/O wait takes 95% of the time, who cares about whether
the tiny fraction remaining is in user or kernel?

As I said, if your kernel is only taking up 5% of your CPU, the best
you can do EVEN IF YOU ELIMINATED EVERY INSTRUCTION is get 5% faster.

> Again, like Larry and I always say, who are the clocks/cache-lines/tlb
> for, the user or the kernel?

If the kernel isn't taking up a significant amount of CPU, what do you
care?

> Even for your job set it does make a difference, in ways that are
> _not_ measurable as "5% system time", that number is partial
> bullshit in _real_ terms, because that calculation does not take
> into consideration point 'b' I have just mentioned.

How's that?

>    The time is all spent waiting for I/O to complete, and assuming you
>    are getting the disk to spit out blocks as fast as it possibly can,
>    or within a few percent of there, you are done. No amount of buming
>    the kernel will ever get you more than another percent or two,
>    period.
> 
> See above about tlb and cache misses due to general kernel bloat.

How is not getting a cache miss going to make the disk spin faster?

>    If you are maxing out the disk already, and your machine's pay is
>    for an I/O bound program like a tickerplant, you won't notice a
>    minor change in latency.
> 
> Yes, you will.  Think about intelligent disks using TAG queueing,

I'm already thinking of them. If you are maxing out the disk, you are
maxing out the disk.

>    Anyway, as I've said, there are serious limits to what is worth
>    doing.
> 
> People from some sects laugh at me when I tell them that major device
> driver interrupt routines should run right out the trap table in raw
> assembly.  Just the other day Matthew Jacob, one of the original
> authors of the ESP scsi driver on the Sparcs brought up the same exact
> idea to eliminate the "SCSI protocol overhead".  I gained a million
> miles of respect for that man in that very instance.

If your disk is already spinning at maximum velocity, and the bits are
emerging from the controller at the maximum rate that the manufacturer
says the thing is capable of, what exactly is recoding the SCSI driver
in assembler going to do for you, other than making the driver machine
dependant?

NetBSD shares a large fraction of its drivers between different
architectures. Many drivers run *unchanged* between Alphas and i386
boxes. This is a *major* win. Recode in assembler to get the extra 1%
of performance the user might notice, and you aren't going to be able
to port to a new machine really fast any more.

>    For networking code, there are significant wins to be had with some
>    simple algorithmic changes. However, overall, as I've noted, if the
>    kernel is chewing only a tiny percentage of your overall time on
>    your system, you aren't going to notice if you optimize it to
>    death, because even if you optimized it away entirely you wouldn't
>    notice.
> 
> Tell the cache and the tlb that, they will laugh at you for hours.

You seem to think if you keep saying this mantra over and over again
it will mean something.

>    Most modern machines spend most of their time IDLE.
> 
> Bwahaha, many people would beg to differ with you highly.

Its true. Go do a metric on the things.

>    Look at Sun Microsystems.
> 
> Yes, look at them, and larry's example of how they nearly took one of
> the largest losses ever because their kernel was a bloated beast and
> slow as molasses.

Solaris sucks. 4.1.X was not great, but people liked the environment
over that of much faster machines.

>    I don't know what you do for a living,
> 
> I change the world.

Mighty nice of you.

I consult to companies that go belly up if their machines don't perform.

>    but I've been a lot of places where mission critical applications
>    on Unix boxes were the firm's life blood. Not once did I ever hear
>    anyone say "gee, HP's boxes have lower interrupt latency" or any
>    !@#% like that.
> 
> Tell that to the millions of ISP's out there, data mining centers
> where an NFS server must be able to keep up with a CRAY over 2
> kilometers of fiber.

If you are using NFS in a high performance application you're a
fool. (I've told this to several ISPs, by the way.) Algorithm problems
again, you know. (BTW, the number of kilometers of fiber doesn't
appreciably change the problem.)  Switching to AFS, or a dedicated
protocol to feed data in and out of your cray, would make vastly more
sense than using NFS. Most Crays get fed by amazing disk and tape
arrays, by the way -- not NFS.

> People doing next generation visualization and simulation over NFS
> using reality engines and getting real time,

NFS isn't built for this. The writeback garbage you have to deal with
is horrendous. Algorithm problems -- not something buming code will fix.

> Why do you think Sun is bragging as of late that they have Solaris
> running web servers better,

Because folks there did things like switching the incoming connection
queue to use a hash, and built a nice datastructure to handle
thousands of half-open connections? No assembler buming in there, by
the by.

> why should they invest so much in this java thing,

Which is slow -- the neat thing about it isn't the hand tuned machine
codeness of it, if you didn't notice.

> the list goes on and on...

I think I've made my point.

Perry