Subject: Re: Multi proc support
To: <>
From: David Laight <david@l8s.co.uk>
List: port-i386
Date: 02/06/2002 22:47:56
Thor Lancelot Simon wrote:
> 
> On Wed, Feb 06, 2002 at 09:35:48PM +0100, Jaromir Dolecek wrote:
> >
> > If the benchmarking involves only I/O, I imagine system using just
> > one processor (and without processor interlocking) _might_ outperform
> > a system using all four CPUs. Very much depends on usage pattern.
> 
> Certainly it's the case that a system using all four CPUs, with
> a multithreaded or otherwise concurrent kernel, will significantly
> ourperform the same system using our current giant-lock kludge --
> which, as you point out, may not even perform as well as a single-CPU
> machine.

Indeed there are a whole lot of unexepcted issued that cause SMP systems
to behave badly.  One issue that is often overlooked (by the
uninitiated) is the severe cost of cache snooping - when a cache line
you want is 'dirty' in a different cache.  Various things have to be
done to keep memory on a single cpu (eg per cpu freelists).  Indeed you
only really get benefits if the workload of the cpus is kept as separate
as possible.  For a big server system it is often advisable to bind
different processes to a cpu - or at least make it unusual for the
scheduler to change cpu.

Locks become memory 'hot spots' even is they are rarely waited for.
> 
> > AFAIK FreeBSD SMP is giant-lock-SMP at the moment too. It would
> > be interesting to find out how well NetBSD and/or FreeBSD SMP
> > performs compared to Linux in quad processor configuration.
> 
> For what it's worth, on my 6-CPU machine, I can build the system
> faster if I remove three processors, no matter how many build jobs
> I use, and I can build the system faster with *one* build job, if
> I leave all CPUs spun up, than with six.

With 1 big lock that doesn't surprise me, of course you could run 5 RC5
clients at the same time.
> 
> The locking has quite a substantial cost.  However, where you really
> lose by running giant-lock on a multiprocessor is in the potential
> performance you *throw away*; running Solaris or UnixWare, if I
> pull three CPUs the NFSops/sec I can run on my machine do, in fact,
> drop by about 30%.  Remember, bulk read/write benchmarks are not
> a realistic predictor of actual I/O performance for most applications;
> there's a certain fixed cost to handling an NFS operation of *any*
> size, and with a highly concurrent kernel you can handle up to N of
> them at a time when you've got N processors; with a giant-lock
> kernel you can handle one of them at a time, period.

A lot of work went into the UnixWare SVR4MP code, including rewrites of
large chunks of existing kernel.  It is designed to run server code, a
single FTP, file copy, or whatever isn't a good test.  The aggregate
rate for lots of concurrent tranfsers is what matters.

I have written multithreaded drivers - including ethernet drivers that
multithread the tx path, and LLC2 (connection mode protocol) that could
multithread multiple connections and most of the work for a single
connection.

	David