NUMA Status: Item Definition
Version 0.4
Default Actions
Many NUMA architectures use a modular architecture that allows "nodes"
containing CPUs, memory, and/or I/O devices to be added in a
building-block fashion. Many of these architectures partition
physical address space in order to simplify hardware design. Such
partitioning can leave large "holes" in the physical address space
when nodes are not fully populated with memory. These "holes" would
cause much memory to be wasted if the physical address space were to
be represented as a linear array. The CONFIG_DISCONTIGMEM code allows
a discontiguous physical address space to be represented without such
wastage.
Each architecture needs to be able to describe its physical layout to
the Linux kernel. This includes specifying which address ranges
belong to which node, and whether there are holes in between those
ranges (a hole is a physical address frange for which there is no real
memory). CONFIG_DISCONTIGMEM is currently used to represent a
solution to some of these problems, but is not necessarily
discontigious (it might be more appropriately named
CONFIG_COMPLEXMEM).
This item includes the problem of configuring alloc_bootmem() and
friends for memory allocation early in the boostrap process. There is
some support for NUMA in 2.4 via the alloc_bootmem_node() interface.
The generic infrastructure work for this is complete, and present in
the current 2.4 kernel. But CONFIG_DISCONTIGMEM is off by default,
except in specific architectures (see the "arch support" section of
CONFIG_DISCONTIGMEM in the table).
It is important that memory be allocated from specific nodes, so that
software running on the machine may take advantage of NUMA locality.
To achieve this, the Linux kernel's memory is partitioned by node.
This partitioning provides the underpinnings for a number of
subsystems requiring NUMA-aware memory allocation.
This functionality is in the 2.4 kernel.
In many workloads, most of the CPU time is consumed by user
applications. This is appropriate: after all, it is the application
that is doing the useful work. Nonetheless, the kernel often does
consume a not-insignificant fraction of the CPU time. In these cases,
it can be important that some kernel data structures be allocated on
the node that they correspond to. Therefore, it is important that the
kernel have some way of specifying which node it wants memory
allocated from. Extensions to kmalloc() and kmem_cache_alloc() are
needed to allow this targeted memory allocation. Since kmalloc() is
built on top of kmem_cache_alloc(), only minor changes to kmalloc()
should be required once the kmem_cache_alloc() modifications are in
place.
It is important that user programs' memory is allocated on a node
close to the one containing the CPU on which they are running.
Therefore, by default, page faults are satisfied by memory from the
node containing the page-faulting CPU. Because the first CPU to touch
the page will be the CPU that faults the page in, this default policy
is called "first touch".
This functionality is in the 2.4 kernel.
Some applications can gain better performance by explicitly
specifying which node's memory should be used for a give range of the
virtual address space, however, this does not qualify as a "default
action". Please see Bind Memory to
Node(s) for more details on this approach.
Since kernel text is read-only on production systems, there is little
downside to replicating it, placing a copy on each node. This does
consume extra memory, but kernel text is relatively small and memories
of NUMA machines relatively large. Kernel-text replication requires
special handling from debuggers when setting breakpoints and from
/dev/mem in cases where users are sufficiently insane to modify kernel
text on the fly. In addition, kernel-text replication means that
there is no longer a single "well-known" offset between a kernel-text
virtual address and the corresponding physical address.
This functionality is present in some architectures (e.g.,
sgi-ip27) in the 2.4 kernel. Also, the ia64 discontigmem
patch provides kernel text replication support for ia64.
Once loaded, kernel module text is also read-only on production
systems, so, again, there is little performance downside to
replicating it, placing a copy on each node. The tradeoffs are
similar to those for replicating kernel text, with the added
complexity resulting from the need to unload kernel modules--some
boot-time mapping tricks that work for kernel text do not work so
easily for kernel-module text.
On the other hand, there is no "well-known offset" between
kernel-module text virtual and physical addresses, so the only thing
lost by replication is the one-to-one correspondence.
User text is also read-only for pruduction applications (applications
under development may modify text to accommodate breakpoints). The
tradeoffs are similar to those for kernel module text, with the
following additional issues: (1) the aggregate size of user text can
be quite large, (2) it may not make sense to replicate every program's
text across every node, especially for single threaded applications
that run on only one CPU (and thus only one node) at a time, and (3)
programs exit much more frequently than modules are unloaded, so that
efficiently dereplicating user text is much more important than is
efficiently cleaning up after a module unload. It is not clear that
immediate dereplication upon exit() is desirable.
User libraries are yet another example of a mapping that is normally
read-only. The tradeoffs are similar to those for user text, however,
libraries such as libc will be heavily shared. Replication of these
libraries would be expected to give greater performance benefits than
would replication of random user programs.
There may be benefits to unifying the user-text and user-library cases
by simply replicating all read-only user mappings. One good candidate
for replication of read-only mappings is NLS data used for
internationalization/localization.
If a given mapping is writable, but very rarely written, one could
obtain performance benefits by replicating the mapping, and mapping it
read-only. When a (assumed rare) write occurred, the kernel would
update all the replicas from the page-fault handler.
It seems unlikely that the complexity and measurement overhead is
justified by the potential performance gains.
Since each node has its own pool of memory, there is motivation to
handle memory pressure separately on the individual nodes. Part of
this would be handled by having a per-node kswapd.
Again, since each node has its own pool of memory, there is motivation
to handle memory pressure separately on the individual nodes. The
other part of this would be handled by having a per-node reclaimd.
Just as a normal SMP scheduler goes to some effort to run processes
and threads on the CPU that they most recently ran on in order to take
advantage of cache state, a NUMA-aware scheduler goes to some effort
to run processes and threads on one of the nodes that they most
recently ran on.
Different architectures are subject to different tradeoffs. For
example, on architectures with per-node caches (which cache memory
fetched from other nodes) reward running a process or thread on the
same node that it ran on previously, even if its memory is allocated
on some other node: the per-node caches act in a manner similar to
per-CPU caches. Architectures without per-nodes caches may offer
little or no performance benefit from running on a different CPU on
the previous node when that process's or thread's memory is allocated
on some other node, depending on details of the CPUs' bus protocol.
NUMA-aware schedulers generally group CPUs into nodes. One
approach is to have a goodness penalty for cross-node scheduling, and
another approach is to restrict the search for a CPU to the node
containing the CPU that the process or thread last ran on.
Andrea Arcangeli and Hubertus
Franke have produced patches that bring some degree of
NUMA-awareness to the scheduler.
There may be some kernel subsystems and data structures that need to
be be partitioned over nodes. Such subsystems and data structures, if
any, will be identified via benchmarking.
Multipath I/O (MPIO) allows redundant connections between the computer
and its storage. These redundant connections allow greater I/O
bandwidth, greater resilience against failure, and, specifically for
NUMA, DMA of any memory to any disk without the data traversing the
global interconnect. (Note, however, that not all NUMA Unix OSes
implement MPIO. MPIO appears to be most important on architectures
with remote caches that run applications with unpredictable
data-reference patterns.)
FibreChannel is often used to implement MPIO, in which case each
node has at least one connection to a FibreChannel switch, as shown in
the diagram. The large arrow to the left represents the system
interconnect.
There are a couple of patches available that are related to
multipath I/O. Code for some T3-specific
facilities is available only for 2.2.X kernels. These add
something called "SCSI ALIASES" to the scsi mid layer. A user level
program actually tells the scsi layer (via IOCTLs on sg driver) that
two are more scsi_device structures point to the same device. Only
one path is active at a time, an alias path is selected in a fail-over
case. It appears to retry failed I/O's assuming this patch works on
block devices (disks) only. No load balancing as multiple paths
aren't active at the same time. It doesn't select a path based on the
topology and error status. It just selects a path next in the list,
if the current active path fails.
Another facility named "md" is available in 2.4. It is actually a
software RAID device driver which also handles multipathing.
Currently, this does not do any load balancing. It does not handle
error status and topology based path selections as it is implemented
above the sd driver.
Some NUMA systems have hardware features that allow software to track
how often given pages (or other ranges of memory) are referenced by
each node. This information may be used to determine when and where
to migrate pages.
SGI has used this approach, but has not been happy with the
results. The usefulness of this approach likely depends on the
hardware, but thus far, there have been no positive experience reports
with it. (But if you know of any, please do not keep them secret!!!)
It is possible to implement page migration on hardware without
special page counters by using page-level protections. However, this
approach seems much less likely to be beneficial than approaches using
special hardware.
NUMA systems that do not have the hardware features required for page
migration can instead migrate on a process-by-process rather than a
page-by-page basis. Such systems monitor long-term memory and CPU
utilization on each node, and choose processes to migrate in order to
maintain long-term balance of this load. These systems typically
handle shorter-term CPU load imbalances by executing processes on
remote nodes, so that a process's CPU and memory might reside on
different nodes.
This approach has been found useful by Sequent (now IBM). Mileage
may vary depending on hardware architecture: Sequent's systems had
large remote caches and relatively high memory-latency ratios.
Systems without remote caches and with low memory-latency ratios may
or may not see significant benefit--some prototyping and benchmarking
is in order here.
NUMA systems might choose to move processes in order to put them close
to devices that they are doing I/O to. I have not heard of any uses
of this, however, the NUMA systems with which I am most familiar all
have multipath I/O, which allows equal access to all storage I/O from
all nodes. I/O-based migration could potentially be useful on systems
that have node-local I/O devices (e.g., that do not support
FibreChannel storage-area networks).
If you know of cases where I/O-based migration was helpful, or if
you have implemented and/or benchmarked it under Linux, please do not
keep it a secret!
NUMA systems can be subject to lock starvation under high lock contention.
Lock starvation occurs when the contention on a given lock is so high that
by the time a CPU releases the lock, at least one other CPU on that same
node is requesting it again. NUMA latencies mean that these local CPUs
learn that the lock is available earlier than remote CPUs do.
On some architectures, the CPUs on this node can monopolize the lock,
completely starving CPUs on other nodes.
The "right" way to solve this problem is to reduce lock
contention. However, in some cases, high lock contention may be
infrequent (for example, occurring only in presence of rare
hardware-fault conditions) and reducing the contention may require
considerable complexity. In such cases, it may be better to have a
NUMA-aware lock that prevents starvation (and increases
high-contention performance) than to address the infrequent contention
problem.
Sequent and IBM have had good experiences with NUMA-aware locking
in both DYNIX/ptx and AIX.
This item includes things like an option to the "top" command that
causes it to focus on the activity on a specified node. Commands that
might have such options include "ps", "pstree", and kernel debuggers.
Scientific applications that require high memory bandwidth may benefit
from development tools that identify data structures that require
special placement. Other ideas here? Is this worthwhile?
Tools that monitor system performance (e.g., "sar", "vmstat",
"netstat") could be enhanced to focus on a specified node or to list
out per-node statistics.
Simple Binding
In order for an application to make effective use of simple binding
requests, it must be able to determine how the system is put together.
In particular, it must be able to determine the following:
- Which node a given CPU is located in.
- What memory is located in a given node.
- What I/O adapters are located in a given node.
- Some idea of the "distance" between a specified pair of nodes.
One way to do this is to create a directory hierarchy that reflects
the topology in /proc. Kanoj has produced a patch
that takes this approach.
It is also useful to have a programmatic interface to map between CPUs and nodes.
It may also be useful to enhance procfs and/or devfs to represent
(either user-specified or otherwise) NUMA bindings/associations of
processes, and to represent I/O topology. However, this will require
some thought as well as a definite proposal.
Systems that allow dynamic reconfiguration (e.g., hot-plugging of
memory and nodes) provide motivation for informing NUMA-aware
applications of changes in the topology, in order to allow these
applications to reshape themselves to the new topology.
A first cut at NUMA APIs should probably ignore dynamic
reconfiguration, which might take the form of a signal or callback to
inform the application of the change.
Some benchmarks and applications can gain higher throughput on both
SMP and NUMA systems by allowing tasks to be bound to specific CPUs.
This binding allows applications to keep specified tasks close to a
given node's memory on NUMA systems and can greatly reduce
cache-bouncing on SMP systems.
The "runon()" API comes close to what is needed, but lacks a
restriction argument. We therefore suggest a bindtocpu()
interface, which binds to a set of CPUs at an arbitrary point in a
process's or thread's execution. To bind at the next fork() or
exec(), launch policies
(similar to those pioneered by HP/UX) should be used so that processes
can be moved from one node to another (if required) at the point where
their state is smallest.
Binding tasks to specific CPUs can in some cases unnecessarily leave
CPUs idle. In these cases, it may be better to bind tasks to specific
nodes, allowing any CPU on that node to run the task. Binding a task
to a specified node helps ensure that the task's memory will be
allocated on that node, reducing memory latency. In addition, any
cache bouncing that does occur due to tasks being moved among CPUs is
less harmful when all memory and cachelines are confined to a single
node.
Note that it is cheapest to perform this binding at exec() time,
since at that time, the process has minimal state that needs to be
moved. However, it is sometimes necessary to perform the binding at
fork() time or at an arbitrary point in the process's execution. This
latter capability is especially useful when running online diagnostics
and software that measures hardware bandwidths and latencies.
To bind to a node at an arbitrary point in a given task's
execution, use bindtonode().
To bind at the next fork() or exec(), launch policies (similar to
those pioneered by HP/UX) should be used so that processes can be
moved from one node to another at the point where their state is
smallest.
By default, the 2.4 kernel binds a process's memory to the node
containing the CPU that the process is running on at the time of the
page fault (a "first touch" allocation policy). However, applications
that span multiple nodes may need to control what memory is allocated
on which node. Existing NUMA-aware Unix systems have implemented
policies that control where page faults allocate memory from,
including "first touch", "stripe", "round robin", and "most free".
"First touch" allocates the memory on the node containing the
page-faulting CPU. "Stripe" allocates memory from a specified subset
of the nodes based on virtual address. "Round robin" allocates memory
in sequence from a specified subset of the nodes.
Of these, "stripe" and "round robin" have been most useful for
commercial database applications. The problem with "first touch" is
that some packages have a single task initialize all of memory, which
results in it being all allocated on a single node. "Most free" has
not seen much use.
This information should be associated with a given range of the
virtual-address space. In the first-cut implementation, it will be
acceptable to restrict each mapping in the virtual-address space to a
single policy. The user can always do multiple mappings, if a finer
policy granularity is desired.
To bind a specific range of memory to node at an arbitrary point
in a given task's execution, use bindmemory().
In some cases, it may make sense to bind a task to an I/O device.
However, in all the architectures that I am aware of, all I/O devices
are associated with a specific node or set of nodes, so that tasks are
bound to nodes rather than to I/O devices. There may be more
motivation for binding tasks to I/O devices in architectures that
allow nodes to be hot-swapped and hot-inserted, or in architectures
that allow subsets of nodes to be connected to a given storage-area
network.
This should not be considered for first-pass NUMA support.
When doing bulk I/O, it can be even more important for the memory to
be on the same node as the device than it is for the task to be
correctly placed. Again, however, I/O devices are located in nodes,
so that the memory can be bound to the node rather than to the device.
And again, there may be more motivation for binding tasks to I/O
devices in architectures that allow nodes to be hot-swapped and
hot-inserted, or in architectures that allow subsets of nodes to be
connected to a given storage-area network.
This should not be considered for first-pass NUMA support.
It is important that the simple-binding capabilities be usable in
scripts, so that non-NUMA-aware applications may take advantage of
NUMA optimizations with few or no changes. Therefore, tools and
commands are needed to do at least the following:
- Print out the machine's topology (this might just be "cat" on
a file in /proc).
- Run a command (and any processes it creates) on a specified CPU
or set of CPUs.
- Run a command (and any processes it creates) on a specified node.
- Determine what binding a specified process is subject to.
This could be a new command and/or a new option to "ps" or "top".
Abstract Binding
The general idea behind abstract binding is to allow applications to
specify binding hints or constraints independent of the underlying
topology. The advantages of this are that multiple copies of the same
program can run efficiently on a given machine without any special
awareness of the details of the machine's configuration. Each program
simply asks for the CPU, memory, and scheduling constraints that it
needs, and the kernel and libraries take care of the rest.
Some instances of abstract binding have been tried and proven in
proprietary Unix OSes, while others are more speculative.
These should not be considered for first-pass NUMA support, but
should be energetically investigated for later implementation.
Memory locality domains allow an application to specify how much
resource it needs (e.g., CPUs and memory), and also a desired topology
(e.g., ring or hypercube). The application creates a memory-locality
domain, then specifies which processes are to be run within that
memory-locality domain.
Memory-locality domains have seen heavy use within SGI's Irix OS. See the lse-tech message dated
July 2nd by Paul Jackson for a similar approach adapted to Linux.
Applications often do not care where a given task is run, as long as
it is run "close" to some other task. This is especially true in
applications that have groups of processes that communicate
intensively with each other. If they are allowed to explicitly
specify that tasks be linked, these applications can gain substantial
performance improvements, again, without the application needing to be
aware of the machine's configuration.
Task-to-task linking has seen heavy use within Sequent's (now
IBM's) DYNIX/ptx OS, particularly
via the attach_proc command and system call and the
QUAD_ATTACH_TO_PARENT flag to qfork() and qexec().
Andi Kleen suggests use of a new CLONE_NODE flag to specify that
the child task be bound to the same node as its parent, and that the
child task should follow the parent if the former is moved.
If multiple applications are to run simultaneously on a given NUMA
system, it is convenient if the resource names can be virtualized.
This is similar to virtual memory: just as a given virtual address
maps to different physical memory in different tasks, the same virtual
(say) CPU number would map to different physical CPUs in different
applications. This allows all applications to be written as if they
were using (say) CPUs starting with CPU 0.
However, it is important to note that some applications, such as
diagnostics and hardware-performance-measurement tools, will need to
be able to specify physical resources. Such applications typically
understand the machine's configuration, and therefore mean it when
they say "CPU 0".
Virtualizing the resource names requires that a virtual view of
the machine topology be available, in order to allow the application
to determine which of its virtual CPUs map to which of its virtual
nodes.
Linking and binding hints can be thought of as a generalization of a
combination of the Specify
Memory Locality Domains and the Link Tasks Together approaches. The idea is that the
application simply specifies constraints on the placement of its
component parts, and the kernel and libraries dynamically determine
where to put all the components of all the applications so as to
optimize performance.
This is currently thought to be a difficult problem to solve, and
thus may be more in the area of research than of implementation.
Nonetheless, I would love for this thought to be proven wrong!