The old Linux scheduler, used in 2.4 and earlier kernels, had
scalability problems due to its single runqueue design. This
limited the throughput for workloads with large number of tasks as well
as increased the lock contention on systems with multiple CPUs.
Andrea Arcangeli added NUMA-awareness to the old Linux scheduler but this approach still had to cope with the single-runqueue.
There were two approaches for improving the scalability of the old Linux scheduler: one by Mike Kravetz and Hubertus Franke and one by Davide Libenzi. Both were multi-queue schedulers. Hubertus Franke, Mike Kravetz, Shailabh Nagar and Rajan Ravindran made their MQ scheduler NUMA-aware by extending it to deal with CPU pools.
As the 2.5 kernel development started, Ingo Molnar wrote a multi-queue scheduler for Linux, the O(1) scheduler. The O(1) scheduler, which was integrated into the 2.5 Linux kernel, is a mulit-queue scheduler with a runqueue per processor. This facilitates dispatching processes on the same processor and thus benefits from cache warmth. However, the O(1) scheduler has no awareness of NUMA nodes and hence makes no effort to keep a process on the same node.
Previous NUMA aware schedulers had been based on the MQ scheduler. Work on these projects stopped with the acceptance of the O(1) scheduler.
O(1) scheduler extensions have been developed that change the load balance logic with the scheduler to favor keeping a process on the same node. In addition, exec() has been enhanced to pick the least loaded node to place a process on during the exec process. These two items proved a basis for a more extensive NUMA-aware scheduler.
Erich Focht has developed a full featured node affine NUMA scheduler. This was originally developed on 2.4 (on top of a backport of the 2.5 O(1) scheduler) for IA64 based NUMA machines. Matt Dobson ported this to the x86 based NUMAQ hardware. Michael Hohnbaum ported this forward to 2.5 and then worked with Erich on further refinements.
During the 2.5 final pre-feature-freeze crunch, Erich broke his
scheduler into 5 pieces - the first two which provide the node-aware
load balance in the scheduler and the balance at exec(). Michael
developed similar extensions. After extensive testing, optimal
performance was found using Erich's first part and Michael's second
part. These have now been combined and exist in a bk tree at