Subject: Re: But why?
To: David S. Miller <davem@caip.rutgers.edu>
From: Tim Newsham <newsham@aloha.net>
List: tech-kern
Date: 10/23/1996 20:09:31
>    The time it takes to move the block off the disk dominates. The
>    microscopic amount of time the machine spends in kernelland
>    doesn't.
> 
> Even for your purely compute/IO bound jobs you are leaving away two
> important issues (at least to me):
> 
> 	a) Consider that your OS it taking say 200 cycles more than it
> 	   needs to with just trap instruction in/out overhead.  Say
> 	   also that once your application takes control again, it
> 	   will take it only 100 instructions to get to the point
> 	   where it asks the kernel to queue up another disk request.
> 	   I'd say you are losing here.

ok.  say I'm reading 256k at a time and perhaps
getting 3.5Mbytes/sec out of my filesystem.  So
a 256k read takes oh, roughly in the ball park of
71,428,571 nanoseconds to complete.  Now say I have
a 10nsec cycle time.  Now I take an OS that has
a "broken" syscall path and I fix it, saving 200cycles
on a syscall.  So now my program can do a 256k read i

   71,428,571 - (200 * 10) = 71,426,571nsec

cool.  I got a speedup of:

   71428571 / 71426571 = 1.000028

not bad for the many hours it took me.

Now what if on the other hand I found that I was doing
something stupid in my buffering of the filesyste blocks
and I took the time to fix that instead and found
out that it gave me a 0.1% speedup for read operations.
Now i'm down to:

   71428571 * (1 - 0.001) = 71357142

hmm.. looks like that was probably a better use
of my time if I do a lot of reads.
  
Looking at the time difference the user of the
application sees most people would agree that
both were a waste of time.

> 	b) If your OS is bloated in it's interrupt path, every single
> 	   clock tick your application is losing real cache lines and
> 	   real TLB entries.  This is a fact and I have seen it in
> 	   action.  A userland memcpy() was faster under SparcLinux
> 	   but a significant and noticable amount than both Solaris
> 	   and SunOS because the Linux kernel had much better
> 	   cache/TLB patterns and footprint.  This is not ever using
> 	   kernel services just like you mention for your
> 	   applications.

I can't tell if you're arguing for your improved system-call
path or not here.  If you're arguing for the improved
syscall path, can you say how much better your cache/TLB
patterns and footprint got better when you started tweaking
the syscall path?  If that is a hard thing to measure, can
you say how much faster things like memcpy() got by
just changing the kernel interupt/syscall path?

If it is true that linux is nicer about using the cache and
tlb (and I believe you when you say it is) it could be
for reasons completely unrelated to the syscall path.

> Again, like Larry and I always say, who are the clocks/cache-lines/tlb
> for, the user or the kernel?  Even for your job set it does make a
> difference, in ways that are _not_ measurable as "5% system time",
> that number is partial bullshit in _real_ terms, because that
> calculation does not take into consideration point 'b' I have just
> mentioned.

do you have measurements of stalls or CPI seen in userland
before and after some of your changes?

> As case in point, I have a researcher here who just crunches data sets
> all day long, specifically he does fluid analysis.  All his jobs are
> raw crunch jobs in fortran.  I did an experiment with him a few weeks
> ago.  We built a binary under SunOS, one single binary for the
> experiment since SparcLinux has the compatability code to run it as
> well.  His job runs for 4 days on average with the data sets he
> currently utilizes.  We ran the same exact binary with the same exact
> data set under SunOS, Solaris, and SparcLinux.  We isolated the
> machine from any network activity at all, we also killed off all
> daemons on the machine for all three OS's before beginning the jobs.
> We basically gave it a single user environment.
> 
> Guess what, the thing ran 5 hours faster under SparcLinux compared to
> Solaris2.5.1, about 5 1/2 hours faster than SunOS4.1.4.  Let's say he
> needs to get as many jobs like this as he can done every week, I think
> it matters to him the things I have presented.   He's told all of the
> ARPA speech processing people upstairs, and they are all asking me
> when they can get cycles on my SparcLinux machine so they can be more
> sure to meet their deadlines.  Thank you very much.

thats great, but where is this time coming from?  How
much of it is from your optimized systemcall path and
how much from other differences in the kernels?
What I'm getting at is that observing OS A with fast 
syscall path is faster than unrelated OS B without fast
syscall path doesnt really prove anythign about the fast
syscall path.  It could very well be that 100% of the
speedup is due to your improvements.  On the other
hand it could be that none of the speedup is due to your
improvements.  I would like to hear of comparisons
of OS A without the fast path and OS A with the fast path.

> David S. Miller
> davem@caip.rutgers.edu

I know that hacking code is much more exciting
than making measurements.  I can understand why
you may not have the measurements everyone wants to see,
but if you want to win over the world (and it appears
you do) it would really help.  If you want to make
your machine faster (and it appears you do) its
nice to know where the time is going and how much 
you're helping.  You may find that there are better
ways to spend your time or you may persuade
the world that they should be  spending their
time looking at your solutions.

                             Tim N.