The old Linux scheduler, used in 2.4 and earlier kernels, had
scalability problems due to its single runqueue design. This
limited the throughput for workloads with large number of tasks as well
as increased the lock contention on systems with multiple CPUs.
Andrea Arcangeli added NUMA-awareness
to the old Linux scheduler but this approach still had to
cope with the single-runqueue.
There were two approaches for improving the scalability of the old
Linux scheduler: one by Mike
Kravetz and Hubertus Franke and one by
Davide Libenzi. Both were multi-queue schedulers.
Hubertus Franke, Mike Kravetz, Shailabh Nagar and Rajan Ravindran made
their MQ
scheduler NUMA-aware by extending it to deal with CPU pools.
As the 2.5 kernel development started, Ingo Molnar wrote a multi-queue
scheduler for Linux, the O(1) scheduler. The O(1) scheduler,
which was integrated into the 2.5 Linux kernel, is a mulit-queue
scheduler with a runqueue per processor. This facilitates
dispatching processes on the same processor and thus benefits from
cache warmth. However, the O(1) scheduler has no awareness
of NUMA nodes and hence makes no effort to keep a process on the same
node.
Previous NUMA aware schedulers had been based on the MQ
scheduler. Work on these projects stopped with the acceptance of
the O(1) scheduler.
O(1) scheduler extensions have been developed that change the load
balance logic with the scheduler to favor keeping a process on the same
node. In addition, exec() has been enhanced to pick the least
loaded node to place a process on during the exec process. These
two items proved a basis for a more extensive NUMA-aware scheduler.
Erich Focht has developed a full featured node affine NUMA scheduler.
This was originally developed on 2.4 (on top of a backport of the 2.5
O(1) scheduler) for IA64 based NUMA machines. Matt Dobson ported
this to the x86 based NUMAQ hardware. Michael Hohnbaum ported
this forward to 2.5 and then worked with Erich on further refinements.
During the 2.5 final pre-feature-freeze crunch, Erich broke his
scheduler into 5 pieces - the first two which provide the node-aware
load balance in the scheduler and the balance at exec(). Michael
developed similar extensions. After extensive testing, optimal
performance was found using Erich's first part and Michael's second
part. These have now been combined and exist in a bk tree at
bk://numa-ef.bkbits.net/numa-sched.