NUMA Aware Scheduler Extensions

On NUMA systems, optimal performance is obtained by locating processes as close to the memory they access as possible.  For most processes, optimal performance is obtained by allocating all memory for the process from the same node, and dispatching the process on processors on that node.  NUMA awareness within the scheduler is necessary in order to support locality of processes to memory - primarily by dispatching a process on the same node through the duration of the procesess' life.

The old Linux scheduler, used in 2.4 and earlier kernels, had scalability problems due to its single runqueue design.  This limited the throughput for workloads with large number of tasks as well as increased the lock contention on systems with multiple CPUs.

Andrea Arcangeli added NUMA-awareness to the old Linux scheduler  but this approach still had to cope with the single-runqueue.

There were two approaches for improving the scalability of the old Linux scheduler: one by Mike Kravetz and Hubertus Franke and one by Davide Libenzi.  Both were multi-queue schedulers.  Hubertus Franke, Mike Kravetz, Shailabh Nagar and Rajan Ravindran made their MQ scheduler NUMA-aware by extending it to deal with CPU pools.

As the 2.5 kernel development started, Ingo Molnar wrote a multi-queue scheduler for Linux, the O(1) scheduler.  The O(1) scheduler, which was integrated into the 2.5 Linux kernel, is a mulit-queue scheduler with a runqueue per processor.  This facilitates dispatching processes on the same processor and thus benefits from cache warmth.   However, the O(1) scheduler has no awareness of NUMA nodes and hence makes no effort to keep a process on the same node.

Previous NUMA aware schedulers had been based on the MQ scheduler.  Work on these projects stopped with the acceptance of the O(1) scheduler.

O(1) scheduler extensions have been developed that change the load balance logic with the scheduler to favor keeping a process on the same node.  In addition, exec() has been enhanced to pick the least loaded node to place a process on during the exec process.  These two items proved a basis for a more extensive NUMA-aware scheduler.

Erich Focht has developed a full featured node affine NUMA scheduler.  This was originally developed on 2.4 (on top of a backport of the 2.5 O(1) scheduler) for IA64 based NUMA machines.  Matt Dobson ported this to the x86 based NUMAQ hardware.  Michael Hohnbaum ported this forward to 2.5 and then worked with Erich on further refinements.

During the 2.5 final pre-feature-freeze crunch, Erich broke his scheduler into 5 pieces - the first two which provide the node-aware load balance in the scheduler and the balance at exec().  Michael developed similar extensions.  After extensive testing, optimal performance was found using Erich's first part and Michael's second part.  These have now been combined and exist in a bk tree at bk://

Work continues on the NUMA aware scheduler using this as a base.