current-users: Re: scheduler woes on MPACPI kernel

Subject: Re: scheduler woes on MPACPI kernel
To: None <current-users@netbsd.org>
From: Thor Lancelot Simon <tls@rek.tjls.com>
List: current-users
Date: 01/19/2005 00:17:39
On Tue, Jan 18, 2005 at 11:14:40PM +0100, Johnny Billquist wrote:
> On Tue, 18 Jan 2005, Frank van der Linden wrote:
> 
> >On Tue, Jan 18, 2005 at 10:53:26PM +0100, Johnny Billquist wrote:
> >>Huh? I didn't understand this...
> >>Are you claiming that a finer grained lock will decrease performance? Or
> >>is there a serious bug in the Linux kernel?
> >
> >For an HT system, finer grained locks can have the effect of increasing
> >the problem of virtual CPUs competing for e.g. cache and TLB entries. Thus
> >making the system slower.
> 
> Hmm. I haven't given much thought on how much impact shared cache and TLB 
> can have. Is it really that bad?

I think you still don't understand what hyperthreading is.  It's not just
"shared cache and TLB", it's basically shared *everything*.  *All* the
actual execution resources are shared; the processor just handles two
instruction streams at once and maintains two program counters.

Now, obviously, to not require massive OS and application-level support,
when you do this to a preexisting processor architecture you have to fake
things up so that the single multithreaded CPU gives a little bit more of
the appearance of being two distinct CPUs than just having two program
counters.  But that, in essence, is what's going on.

This is a latency-hiding trick in much the same way that out-of-order
execution.  For *some* workloads, notably certain application-level
floating-point-intensive workloads, even when there's already an out of
order execution engine working on the same problems, hyperthreading can
be a small win.

But for most workloads, it is not.  It is particularly unsurprising that
it does not speed up a system build since recent benchmarks I've run
while tuning the Project's own continuous-build system indicate that the
system build actually has rather good locality of reference and that it
is not nearly so demanding of the memory subsystem -- even in terms of
latency -- as we might have previously thought.  There is more than
enough latency-hiding in the out-of-order engines on most modern CPUs to
keep things chugging along nicely.

Which leaves you with no win from hyperthreading -- but all the "lose"
that you get from the multiprocessor locking, TLB contention, and so
forth of a real multiprocesor system.

It is not surprising that it is slower.

Thor