NUMA Status: Item Definition

Version 0.4

Default Actions

CONFIG_DISCONTIGMEM

Many NUMA architectures use a modular architecture that allows "nodes" containing CPUs, memory, and/or I/O devices to be added in a building-block fashion. Many of these architectures partition physical address space in order to simplify hardware design. Such partitioning can leave large "holes" in the physical address space when nodes are not fully populated with memory. These "holes" would cause much memory to be wasted if the physical address space were to be represented as a linear array. The CONFIG_DISCONTIGMEM code allows a discontiguous physical address space to be represented without such wastage.

Architecture specific CONFIG_DISCONTIGMEM support

Each architecture needs to be able to describe its physical layout to the Linux kernel. This includes specifying which address ranges belong to which node, and whether there are holes in between those ranges (a hole is a physical address frange for which there is no real memory). CONFIG_DISCONTIGMEM is currently used to represent a solution to some of these problems, but is not necessarily discontigious (it might be more appropriately named CONFIG_COMPLEXMEM).

This item includes the problem of configuring alloc_bootmem() and friends for memory allocation early in the boostrap process. There is some support for NUMA in 2.4 via the alloc_bootmem_node() interface.

Generic CONFIG_DISCONTIGMEM support

The generic infrastructure work for this is complete, and present in the current 2.4 kernel. But CONFIG_DISCONTIGMEM is off by default, except in specific architectures (see the "arch support" section of CONFIG_DISCONTIGMEM in the table).

Memory Classes: User Memory

It is important that memory be allocated from specific nodes, so that software running on the machine may take advantage of NUMA locality. To achieve this, the Linux kernel's memory is partitioned by node. This partitioning provides the underpinnings for a number of subsystems requiring NUMA-aware memory allocation.

This functionality is in the 2.4 kernel.

Memory Classes: kmalloc() and kmem_cache_alloc()

In many workloads, most of the CPU time is consumed by user applications. This is appropriate: after all, it is the application that is doing the useful work. Nonetheless, the kernel often does consume a not-insignificant fraction of the CPU time. In these cases, it can be important that some kernel data structures be allocated on the node that they correspond to. Therefore, it is important that the kernel have some way of specifying which node it wants memory allocated from. Extensions to kmalloc() and kmem_cache_alloc() are needed to allow this targeted memory allocation. Since kmalloc() is built on top of kmem_cache_alloc(), only minor changes to kmalloc() should be required once the kmem_cache_alloc() modifications are in place.

Default Memory Binding

It is important that user programs' memory is allocated on a node close to the one containing the CPU on which they are running. Therefore, by default, page faults are satisfied by memory from the node containing the page-faulting CPU. Because the first CPU to touch the page will be the CPU that faults the page in, this default policy is called "first touch".

This functionality is in the 2.4 kernel.

Some applications can gain better performance by explicitly specifying which node's memory should be used for a give range of the virtual address space, however, this does not qualify as a "default action". Please see Bind Memory to Node(s) for more details on this approach.

Replication: Kernel Text

Since kernel text is read-only on production systems, there is little downside to replicating it, placing a copy on each node. This does consume extra memory, but kernel text is relatively small and memories of NUMA machines relatively large. Kernel-text replication requires special handling from debuggers when setting breakpoints and from /dev/mem in cases where users are sufficiently insane to modify kernel text on the fly. In addition, kernel-text replication means that there is no longer a single "well-known" offset between a kernel-text virtual address and the corresponding physical address.

This functionality is present in some architectures (e.g., sgi-ip27) in the 2.4 kernel. Also, the ia64 discontigmem patch provides kernel text replication support for ia64.

Replication: Kernel Modules

Once loaded, kernel module text is also read-only on production systems, so, again, there is little performance downside to replicating it, placing a copy on each node. The tradeoffs are similar to those for replicating kernel text, with the added complexity resulting from the need to unload kernel modules--some boot-time mapping tricks that work for kernel text do not work so easily for kernel-module text.

On the other hand, there is no "well-known offset" between kernel-module text virtual and physical addresses, so the only thing lost by replication is the one-to-one correspondence.

Replication: User Text

User text is also read-only for pruduction applications (applications under development may modify text to accommodate breakpoints). The tradeoffs are similar to those for kernel module text, with the following additional issues: (1) the aggregate size of user text can be quite large, (2) it may not make sense to replicate every program's text across every node, especially for single threaded applications that run on only one CPU (and thus only one node) at a time, and (3) programs exit much more frequently than modules are unloaded, so that efficiently dereplicating user text is much more important than is efficiently cleaning up after a module unload. It is not clear that immediate dereplication upon exit() is desirable.

Replication: User Libraries

User libraries are yet another example of a mapping that is normally read-only. The tradeoffs are similar to those for user text, however, libraries such as libc will be heavily shared. Replication of these libraries would be expected to give greater performance benefits than would replication of random user programs.

Replication: Read-Only Mappings

There may be benefits to unifying the user-text and user-library cases by simply replicating all read-only user mappings. One good candidate for replication of read-only mappings is NLS data used for internationalization/localization.

Replication: Read-Mostly Mappings

If a given mapping is writable, but very rarely written, one could obtain performance benefits by replicating the mapping, and mapping it read-only. When a (assumed rare) write occurred, the kernel would update all the replicas from the page-fault handler.

It seems unlikely that the complexity and measurement overhead is justified by the potential performance gains.

Per-Node kswapd

Since each node has its own pool of memory, there is motivation to handle memory pressure separately on the individual nodes. Part of this would be handled by having a per-node kswapd.

Per-Node reclaimd

Again, since each node has its own pool of memory, there is motivation to handle memory pressure separately on the individual nodes. The other part of this would be handled by having a per-node reclaimd.

NUMA-Aware Scheduler

Just as a normal SMP scheduler goes to some effort to run processes and threads on the CPU that they most recently ran on in order to take advantage of cache state, a NUMA-aware scheduler goes to some effort to run processes and threads on one of the nodes that they most recently ran on.

Different architectures are subject to different tradeoffs. For example, on architectures with per-node caches (which cache memory fetched from other nodes) reward running a process or thread on the same node that it ran on previously, even if its memory is allocated on some other node: the per-node caches act in a manner similar to per-CPU caches. Architectures without per-nodes caches may offer little or no performance benefit from running on a different CPU on the previous node when that process's or thread's memory is allocated on some other node, depending on details of the CPUs' bus protocol.

NUMA-aware schedulers generally group CPUs into nodes. One approach is to have a goodness penalty for cross-node scheduling, and another approach is to restrict the search for a CPU to the node containing the CPU that the process or thread last ran on.

Andrea Arcangeli and Hubertus Franke have produced patches that bring some degree of NUMA-awareness to the scheduler.

Kernel-Subsystem NUMA-ization

There may be some kernel subsystems and data structures that need to be be partitioned over nodes. Such subsystems and data structures, if any, will be identified via benchmarking.

Multipath I/O

Multipath I/O (MPIO) allows redundant connections between the computer and its storage. These redundant connections allow greater I/O bandwidth, greater resilience against failure, and, specifically for NUMA, DMA of any memory to any disk without the data traversing the global interconnect. (Note, however, that not all NUMA Unix OSes implement MPIO. MPIO appears to be most important on architectures with remote caches that run applications with unpredictable data-reference patterns.)

FibreChannel is often used to implement MPIO, in which case each node has at least one connection to a FibreChannel switch, as shown in the diagram. The large arrow to the left represents the system interconnect.

There are a couple of patches available that are related to multipath I/O. Code for some T3-specific facilities is available only for 2.2.X kernels. These add something called "SCSI ALIASES" to the scsi mid layer. A user level program actually tells the scsi layer (via IOCTLs on sg driver) that two are more scsi_device structures point to the same device. Only one path is active at a time, an alias path is selected in a fail-over case. It appears to retry failed I/O's assuming this patch works on block devices (disks) only. No load balancing as multiple paths aren't active at the same time. It doesn't select a path based on the topology and error status. It just selects a path next in the list, if the current active path fails.

Another facility named "md" is available in 2.4. It is actually a software RAID device driver which also handles multipathing. Currently, this does not do any load balancing. It does not handle error status and topology based path selections as it is implemented above the sd driver.

Page Migration

Some NUMA systems have hardware features that allow software to track how often given pages (or other ranges of memory) are referenced by each node. This information may be used to determine when and where to migrate pages.

SGI has used this approach, but has not been happy with the results. The usefulness of this approach likely depends on the hardware, but thus far, there have been no positive experience reports with it. (But if you know of any, please do not keep them secret!!!)

It is possible to implement page migration on hardware without special page counters by using page-level protections. However, this approach seems much less likely to be beneficial than approaches using special hardware.

Process Migration

NUMA systems that do not have the hardware features required for page migration can instead migrate on a process-by-process rather than a page-by-page basis. Such systems monitor long-term memory and CPU utilization on each node, and choose processes to migrate in order to maintain long-term balance of this load. These systems typically handle shorter-term CPU load imbalances by executing processes on remote nodes, so that a process's CPU and memory might reside on different nodes.

This approach has been found useful by Sequent (now IBM). Mileage may vary depending on hardware architecture: Sequent's systems had large remote caches and relatively high memory-latency ratios. Systems without remote caches and with low memory-latency ratios may or may not see significant benefit--some prototyping and benchmarking is in order here.

I/O-Based Migration

NUMA systems might choose to move processes in order to put them close to devices that they are doing I/O to. I have not heard of any uses of this, however, the NUMA systems with which I am most familiar all have multipath I/O, which allows equal access to all storage I/O from all nodes. I/O-based migration could potentially be useful on systems that have node-local I/O devices (e.g., that do not support FibreChannel storage-area networks).

If you know of cases where I/O-based migration was helpful, or if you have implemented and/or benchmarked it under Linux, please do not keep it a secret!

NUMA-Aware Locking

NUMA systems can be subject to lock starvation under high lock contention. Lock starvation occurs when the contention on a given lock is so high that by the time a CPU releases the lock, at least one other CPU on that same node is requesting it again. NUMA latencies mean that these local CPUs learn that the lock is available earlier than remote CPUs do. On some architectures, the CPUs on this node can monopolize the lock, completely starving CPUs on other nodes.

The "right" way to solve this problem is to reduce lock contention. However, in some cases, high lock contention may be infrequent (for example, occurring only in presence of rare hardware-fault conditions) and reducing the contention may require considerable complexity. In such cases, it may be better to have a NUMA-aware lock that prevents starvation (and increases high-contention performance) than to address the infrequent contention problem.

Sequent and IBM have had good experiences with NUMA-aware locking in both DYNIX/ptx and AIX.

NUMA-Aware Tools

This item includes things like an option to the "top" command that causes it to focus on the activity on a specified node. Commands that might have such options include "ps", "pstree", and kernel debuggers.

NUMA-Aware Development Tools

Scientific applications that require high memory bandwidth may benefit from development tools that identify data structures that require special placement. Other ideas here? Is this worthwhile?

Performance-Monitoring Tools

Tools that monitor system performance (e.g., "sar", "vmstat", "netstat") could be enhanced to focus on a specified node or to list out per-node statistics.

Simple Binding

Topology Discovery

In order for an application to make effective use of simple binding requests, it must be able to determine how the system is put together. In particular, it must be able to determine the following:
  1. Which node a given CPU is located in.
  2. What memory is located in a given node.
  3. What I/O adapters are located in a given node.
  4. Some idea of the "distance" between a specified pair of nodes.
One way to do this is to create a directory hierarchy that reflects the topology in /proc. Kanoj has produced a patch that takes this approach.

It is also useful to have a programmatic interface to map between CPUs and nodes.

It may also be useful to enhance procfs and/or devfs to represent (either user-specified or otherwise) NUMA bindings/associations of processes, and to represent I/O topology. However, this will require some thought as well as a definite proposal.

Topology Discovery: Dynamic Reconfiguration

Systems that allow dynamic reconfiguration (e.g., hot-plugging of memory and nodes) provide motivation for informing NUMA-aware applications of changes in the topology, in order to allow these applications to reshape themselves to the new topology.

A first cut at NUMA APIs should probably ignore dynamic reconfiguration, which might take the form of a signal or callback to inform the application of the change.

Bind Tasks to CPU(s)

Some benchmarks and applications can gain higher throughput on both SMP and NUMA systems by allowing tasks to be bound to specific CPUs. This binding allows applications to keep specified tasks close to a given node's memory on NUMA systems and can greatly reduce cache-bouncing on SMP systems.

The "runon()" API comes close to what is needed, but lacks a restriction argument. We therefore suggest a bindtocpu() interface, which binds to a set of CPUs at an arbitrary point in a process's or thread's execution. To bind at the next fork() or exec(), launch policies (similar to those pioneered by HP/UX) should be used so that processes can be moved from one node to another (if required) at the point where their state is smallest.

Bind Tasks to Node(s)

Binding tasks to specific CPUs can in some cases unnecessarily leave CPUs idle. In these cases, it may be better to bind tasks to specific nodes, allowing any CPU on that node to run the task. Binding a task to a specified node helps ensure that the task's memory will be allocated on that node, reducing memory latency. In addition, any cache bouncing that does occur due to tasks being moved among CPUs is less harmful when all memory and cachelines are confined to a single node.

Note that it is cheapest to perform this binding at exec() time, since at that time, the process has minimal state that needs to be moved. However, it is sometimes necessary to perform the binding at fork() time or at an arbitrary point in the process's execution. This latter capability is especially useful when running online diagnostics and software that measures hardware bandwidths and latencies.

To bind to a node at an arbitrary point in a given task's execution, use bindtonode().

To bind at the next fork() or exec(), launch policies (similar to those pioneered by HP/UX) should be used so that processes can be moved from one node to another at the point where their state is smallest.

Bind Memory to Node(s)

By default, the 2.4 kernel binds a process's memory to the node containing the CPU that the process is running on at the time of the page fault (a "first touch" allocation policy). However, applications that span multiple nodes may need to control what memory is allocated on which node. Existing NUMA-aware Unix systems have implemented policies that control where page faults allocate memory from, including "first touch", "stripe", "round robin", and "most free". "First touch" allocates the memory on the node containing the page-faulting CPU. "Stripe" allocates memory from a specified subset of the nodes based on virtual address. "Round robin" allocates memory in sequence from a specified subset of the nodes.

Of these, "stripe" and "round robin" have been most useful for commercial database applications. The problem with "first touch" is that some packages have a single task initialize all of memory, which results in it being all allocated on a single node. "Most free" has not seen much use.

This information should be associated with a given range of the virtual-address space. In the first-cut implementation, it will be acceptable to restrict each mapping in the virtual-address space to a single policy. The user can always do multiple mappings, if a finer policy granularity is desired.

To bind a specific range of memory to node at an arbitrary point in a given task's execution, use bindmemory().

Bind Tasks to I/O Devices

In some cases, it may make sense to bind a task to an I/O device. However, in all the architectures that I am aware of, all I/O devices are associated with a specific node or set of nodes, so that tasks are bound to nodes rather than to I/O devices. There may be more motivation for binding tasks to I/O devices in architectures that allow nodes to be hot-swapped and hot-inserted, or in architectures that allow subsets of nodes to be connected to a given storage-area network.

This should not be considered for first-pass NUMA support.

Bind Memory to I/O Devices

When doing bulk I/O, it can be even more important for the memory to be on the same node as the device than it is for the task to be correctly placed. Again, however, I/O devices are located in nodes, so that the memory can be bound to the node rather than to the device. And again, there may be more motivation for binding tasks to I/O devices in architectures that allow nodes to be hot-swapped and hot-inserted, or in architectures that allow subsets of nodes to be connected to a given storage-area network.

This should not be considered for first-pass NUMA support.

Tools and Commands

It is important that the simple-binding capabilities be usable in scripts, so that non-NUMA-aware applications may take advantage of NUMA optimizations with few or no changes. Therefore, tools and commands are needed to do at least the following:
  1. Print out the machine's topology (this might just be "cat" on a file in /proc).
  2. Run a command (and any processes it creates) on a specified CPU or set of CPUs.
  3. Run a command (and any processes it creates) on a specified node.
  4. Determine what binding a specified process is subject to. This could be a new command and/or a new option to "ps" or "top".

Abstract Binding

The general idea behind abstract binding is to allow applications to specify binding hints or constraints independent of the underlying topology. The advantages of this are that multiple copies of the same program can run efficiently on a given machine without any special awareness of the details of the machine's configuration. Each program simply asks for the CPU, memory, and scheduling constraints that it needs, and the kernel and libraries take care of the rest.

Some instances of abstract binding have been tried and proven in proprietary Unix OSes, while others are more speculative.

These should not be considered for first-pass NUMA support, but should be energetically investigated for later implementation.

Specify Memory Locality Domains

Memory locality domains allow an application to specify how much resource it needs (e.g., CPUs and memory), and also a desired topology (e.g., ring or hypercube). The application creates a memory-locality domain, then specifies which processes are to be run within that memory-locality domain.

Memory-locality domains have seen heavy use within SGI's Irix OS. See the lse-tech message dated July 2nd by Paul Jackson for a similar approach adapted to Linux.

Link Tasks Together

Applications often do not care where a given task is run, as long as it is run "close" to some other task. This is especially true in applications that have groups of processes that communicate intensively with each other. If they are allowed to explicitly specify that tasks be linked, these applications can gain substantial performance improvements, again, without the application needing to be aware of the machine's configuration.

Task-to-task linking has seen heavy use within Sequent's (now IBM's) DYNIX/ptx OS, particularly via the attach_proc command and system call and the QUAD_ATTACH_TO_PARENT flag to qfork() and qexec().

Andi Kleen suggests use of a new CLONE_NODE flag to specify that the child task be bound to the same node as its parent, and that the child task should follow the parent if the former is moved.

Virtualize Resource Names

If multiple applications are to run simultaneously on a given NUMA system, it is convenient if the resource names can be virtualized. This is similar to virtual memory: just as a given virtual address maps to different physical memory in different tasks, the same virtual (say) CPU number would map to different physical CPUs in different applications. This allows all applications to be written as if they were using (say) CPUs starting with CPU 0.

However, it is important to note that some applications, such as diagnostics and hardware-performance-measurement tools, will need to be able to specify physical resources. Such applications typically understand the machine's configuration, and therefore mean it when they say "CPU 0".

Virtualizing the resource names requires that a virtual view of the machine topology be available, in order to allow the application to determine which of its virtual CPUs map to which of its virtual nodes.

Linking/Binding Hints

Linking and binding hints can be thought of as a generalization of a combination of the Specify Memory Locality Domains and the Link Tasks Together approaches. The idea is that the application simply specifies constraints on the placement of its component parts, and the kernel and libraries dynamically determine where to put all the components of all the applications so as to optimize performance.

This is currently thought to be a difficult problem to solve, and thus may be more in the area of research than of implementation. Nonetheless, I would love for this thought to be proven wrong!