current-users: Re: Multiprocessor with NetBSD ?

Subject: Re: Multiprocessor with NetBSD ?
To: NetBSD-current <current-users@netbsd.org>
From: mike stone <bsdusr@yawp.com>
List: current-users
Date: 06/05/2001 17:18:18
>> .   for tightly-linked
>> MP systems, sharing the memory bus is an inescapable serial 
>> dependency,
>> so you're pretty much stuck with sub-linear speedup.
>
> Ahh but are you not forgetting that there is no longer a linear
> connection between the processor and the main memory?  There are at
> least a couple of levels of cache in there that reduce the dependency
> on going to main memory.

nope, not forgetting it.. caches don't eliminate the Von Neumann
bottleneck or Amdahl's Law, they just boost the effective speed of a
machine's memory bus by cutting out latency.

in general, there are two kinds of caches:  a small, per-core cache
right next to each ALU in a CPU, and an L2 cache that sits between
the CPU and RAM.   bandwidth between an ALU and its private cache
can be blisteringly fast because both cache and bus are doped right
into the die.   an L2 cache is slower than the per-core cache, because
it's on a separate chip and the bus is longer, but it can still be
a heck of a lot faster than main RAM.

caches reduce latency by preloading data from slower repositories.
the idea is to suck enough data into the cache that the CPU can
chew through one load while the next (comparatively slow) cache
update makes its way along the bus.   if you're doing things right,
you can store terabytes on information on really slow media, like
DAT, then preload chunks of that to something faster, like a hard
drive.   then you can preload data from the hard drive into main RAM,
preload data from main RAM into the L2 cache, preload  data from the
L2 cache into the per-core caches, and keep your ALUs cooking along
at their maximum rate all the way from one end of a tape to the other.

all that does is simulate a blisteringly fast overall memory bus by
swapping unused data off to slower media when it's not being used,
though.   the Von Neumann bottleneck is still there because you can't
process data faster than the maximum rate at which data travels between
CPU and memory.   it's a theoretical limit, not an engineering problem.

the serialization penalties of Amdahl's Law are still there, too,
because all the cores have to keep their caches in synch.   if
multiple cores need to use the same data, and even one of those cores
is planning to change that data, all those cores have to share a single
cache.   if they don't, you get a train wreck.   either two cores with
write access will try to write new values to the same location, or
a core will read a stale value from its own cache after the 
authoritative
version has changed.

so.. a shared-everything MP system can work no faster than its
maximum bus speed, no matter how many cores or CPUs it has.   and
while caching *can* boost your maximum bus speed dramatically, you
still run into serial limits when two cores have to work with the
same memory location.



> The other thing that is helping to linearise the curve is that
> larger MP systems do not have a single procesor to memory bus.  They
> tend to have what is, in effect, a large crossbar switch between the
> processors and memory so access to memory is not shared between all
> the processors.

true, but that's just one more way to speed up your memory bus.   it's
also expensive.

ideally, each core would have its own bus to main RAM.   that means
having a switch with N cores on one side, and N buses to memory on the
other.   for N cores and buses, the complexity of the switch is N^2.
you also have to devote N^2 as much space on your circuit board for
bus traces.


my not-so-humble opinion is that we should leave SMP and all its scaling
problems to Linux, and work on making NetBSD a really good platform for
shared-nothing clusters.   that means a 'ps' command that can display
all the processes running on any machine in a network, and being able
to SIGHUP a process that's running on a different machine.

there's plenty of work for shared-nothing clusters.   imagine how cool
it would be to run a webserver farm if you could type 'apachectl start'
into the command-line of any machine, and boot httpds on every machine
in the network.

there are lots of interesting problems to be solved at that level.   and
while there may be some geekish cachet in being able to say your OS runs
on a 64-CPU Sparc box that one person in two million can afford, i
personally think an OS that can turn 2500 machines of varying age and
architecture into a RAIC that presents a uniform, single-system 
appearance
from any machine in the network would be *darned* useful.



mike
.