Subject: Re: But why?
To: David S. Miller <davem@caip.rutgers.edu>
From: John S. Dyson <toor@dyson.iquest.net>
List: tech-kern
Date: 10/24/1996 01:21:23
> 
> One of the points I was trying to make is that, in certain cases I can
> document pretty well, these "micro optimizations" can do quite a bit.
> 
Yep, some things that I am experimenting on for FreeBSD, for example
will improve performance significantly better than simply improving
essentially a single subroutine call overhead in the syscall code.  There
is alot more going on deep in the bowels of the kernel, many more calls, etc.
That is where I am working on (and have been all along.) 

I have the most fun changing O(n^2) algorithms (or worse) into linear
or near linear.  IMO, That is much of what makes systems run better.  Of
course, reading/writing one byte at a time is pretty much one of the first
things that a beginning programmer learns NOT to do...  Even on Linux,
I dare say that a 1k write is faster than a 1byte write.  Sometimes,
there is no way out of the single byte write though, and that is
where the very low latencies can be useful...

I tend to agree with the general statements from the Linux people that
imply (but I am not sure that they have directly said) that the cache
footprint is important, and FreeBSD has some both user and kernel
improvements in that area, both at the micro-level and at the higher levels.
Many of those changes are indirectly applicable to NetBSD, and when
NetBSD gains more of a performance focus, they can quickly adopt.
Non-optimal layout of a data structure can easily and quickly blow away
any difference in a system call convention.

One of the very tricky things is managing stuff that is usually represented
as a long list -- imagine the processor cache footprint for long lists!!!  We
have been working to minimize the cache footprints for manipulating data
structures.  The interesting thing, is that doing so is counter-intuitive,
and lmbench results can look worse, but real (large) systems run much faster.
I think that people naturally think in terms of static sizes of data structures,
and not necessarily how they are accessed.

Recently, I made a change that made a certain lmbench benchmark run 10% slower,
but actually sped up the system by 2-3X, under certain heavy load conditions.
Those heavy load conditions happen regularly on various types of internet
servers (but are unlikely to happen on workstations.)  Additional cache layout
improvements have regained the 10% and perhaps even more.

A simple, ad-hoc benchmark that I use to compare systems, for example, is to run
multiple copies of heapsort or nsieve from aburto's suite of very simple
benchmarks.  Certainly, individually, they are not fine examples of
scaling benchmarks.  Run lots of them, and the results become more interesting.

Tools such as lmbench, some home-brew code (I found some of mine,
finally), and INTELLEGENTLY interpreting the results is the way that I
work on improving FreeBSD...  However, I would definitely say that even though
I don't ignore the additional overhead of the system calls, but it definitely
isn't where there is alot of time to be mined.  Scalability of algorithms
and cache footprints are.

John