Re: Making run queues independent of the pluggable scheduler

Andrew Doran wrote:
| Hi,
| The diff below extracts the per-CPU run queue code from the M2
scheduler and
| makes it non optional, removing the 4BSD scheduler's global run queue.
| the patch, it means that the pluggable scheduler is responsible only for
| adjusting the priority of timeshared jobs.
| Reasons for doing this:
| - 4BSD gains processor sets/affinity, although I haven't tested that yet.
| - 4BSD gets a huge performance boost on producer/consumer workloads like
|   sysbench OLTP.
| - We have less code to maintain.
| There are a couple of other changes:
| - It makes sched_enqueue responsible for causing a preemption if needed.
|   Previously this was left up to the caller and was only done at one site
|   (sleepq_remove).
| - It changes the CPU selection algorithm slightly. Weak affinity is not
|   considered until the job has context switched a preset number of times,
|   currently 5. This is to try and better distribute jobs among the
CPUs.  It
|   uses the new call idle_pick to find an idle CPU if possible. If no idle
|   CPUs, it does a circular scan of CPUs instead of always starting at the
|   first CPU. That's to try and ensure that we don't unfairly overload one
|   CPU. I will make the CPU selection changes a seperate commit if they
|   been demonstrated to be worthwhile.

That sounds like the new CPU selection algorithm runs most efficient
on a single-socket multi-core machine. Can you elaborate how it is
intended to scale on NUMA machines, please ?

| ... and a couple of notes:
| - Some or all of the items in runqueue_t could be safely merged into
|   schedstate_percpu, but I think it's better to integrate things piecemeal
|   if possible.
| - Previously M2's per-CPU approach performed poorly on but with
|   yesterdays changes to rwlocks and turnstiles it matches the global run
|   queue used by 4BSD. This shows the number of seconds to complete
|   -j16 release on an 8-core machine:
| Comments?
| Thanks,
| Andrew

