Linux-on-NUMA Technical Meeting

March 26, 2001
Chicago O'Hare Airport
Agenda

Attendees

Name	Email
Fumio Aono	aono@ed2.com1.fc.nec.co.jp
Andrea Arcangeli	andrea@suse.de
Pete Beckman	beckman@turbolabs.com
Tom Butler	tbutler@hi.com
Hubertus Franke	frankeh@us.ibm.com
Jerry Harrow	Jerry.Harrow@Compaq.com
Bjorn Helgaas	bjorn_helgaas@hp.com
Andi Kleen	ak@suse.de
Kouichi Kumon	kumon@flab.fujitsu.co.jp
Asit Mallick	asit.mallick@intel.com
Paul E. McKenney	pmckenne@us.ibm.com
Dave Norton	dnorton@mclinux.com
Kanoj Sarcar	kanoj@engr.sgi.com
Sunil Saxena	sunil.saxena@intel.com
Kimio Suganuma	suganuma@hpc.bs1.fc.nec.co.jp
Larry Woodman	woodman@missioncriticallinux.com
John Wright	jwright@engr.sgi.com

Minutes

SGI Experience with NUMA and Linux

Slides (StarOffice)
Slides (MS PPT)
Irix NUMA API

Kanoj Sarcar and John Wright presented SGI's experience with NUMA and Linux and proposed areas of focus for Linux NUMA effort. They also presented historical perspective from IRIX.

Kanoj discussed a global page stealer, which looks at all nodes. There are some problems with per-node memory exhaustion.

The scheduler looks at the home node if affinity is exhausted. cpusets are used to get repeatable benchmark runs.

A hardware graph to represent NUMA topology. "NUIA" is shorthand for "non-uniform I/O accesses".

Irix makes heavy use of affinity (also called "binding") for framebuffers and in benchmarks. Andi Kleen expects this affinity will also be useful for networking. There is no compiler-based affinity, instead, user level tools are used to force affinity manually if the default OS behavior is undesirable. There was some question about how beneficial binding would be, but a number of people felt that it would be especially helpful for direct-connect devices, and would be somewhat helpful even for switch devices, such as FibreChannel storage-area networks (SANs).

It turns out that Irix replicates only text pages. Andrea suggested making decision at mmap() time, leveraging the work done in Linux to handle file truncation and mmap.

Irix handles System-V shared memory by using address-range-wide policy to specify characteristics. Migration has not proven useful on Irix workloads, but Larry Woodman pointed out that mileage may vary running different workloads on different systems. In addition, Irix uses page-by-page migration (or larger multipage chunks) with hardware assist, while some other OSes use process-at-a-time migration.

Kanoj does not recommend a hard-and-fast definition of what a node contains, instead allowing CPU-only, memory-only, or even I/O-only nodes.

Kanoj has a patch for spreading bootmem across nodes. This is important in some architectures that have limited amounts of low memory. In such architectures, consuming all the low memory on one of the nodes will cause trouble (e.g., with device drivers) later on.

Kanoj noted that replication requires changes to the kernel debugger, /dev/kmem, and dprobes. For example, if kernel text is replicated, and someone uses /dev/kmem to modify kernel text, the modification must be performed on all the replicas. In addition, allocation failures on a given node must allow guilt-free retries on other nodes. Currently, any sort of failure kicks off kswapd, kreclaimd, and other undesirable global operations. Andrea noted that the required strategy for per-node memory exhaustion may be architecture dependent.

Text replication makes it impossible to use the current 1-to-1 mapping. In addition, it may be necessary to vary granularity to handle large pages. Per-CPU irq lines are desirable to avoid problems with devices in different nodes wanting to use the same irq. There are potential issues with packet misordering if packets for the same TCP connection are received on different nodes, but Linux already suffers packet misordering because it uses per-cpu rrupt queues. There are fixes to TCP in the works to better handle packet misordering.

Is it reasonable to allow users to make "stupid" requests of the system, for example to put a given task's memory all on one node, but only allow it to execute on CPUs contained in another node. The general consensus was that it is reasonable to allow diagnostic and hardware-performance-measurement software to do this.

Discussion of what sorts of NUMA topologies should be supported resulted in the following issues:

Should latencies be reflexive (some ring topologies will violate this to some extent)?
Should node-device distances be represented separately (general consensus seemed to be that a device's latency should be represented as the latency of the node that it resides in, though this may be problematic as I/O interconnects similar to Infiniband become more widespread)?
Should latency ratios be explicitly represented (some feeling that these sorts of things should be calculated from a node-to-node distance matrix)?
Should the number of CPUs per node be explicitly represented (users can compute this given the set of CPUs and a way to identify which node a CPU is the member of)?
Should inter-node bandwidth be explicitly represented (resulted in a long discussion of all the factors affecting node-to-node latency and bandwidth, concluding with a general consensus that an abstract distance metric was best--applications wanting extremely detailed information always have the option of just measuring it).

Kanoj noted that kernel placement works very well in absence of dynamic creation and deletion of threads. There are many scientific workloads that run with a known fixed number of threads, but there are others that dynamically create and destroy threads in unpredictable ways.

SuSE Experience with NUMA and Linux

Slides (Adobe PDF)
Slides (PostScript)

Andrea Arcangeli described his experience with NUMA on Compaq's Wildfire (DEC Alpha-based) system. This system has 330ns local and 960ns remote memory latency. The general approach is to get the system tuned so that it can make use of the full local-memory bandwidth.

Andrea started the port in March 2000, and was up and running on a GS320 at the end of 2000.

The modified kernel allocates memory from the local node first, then falls back to other nodes. It supports "holes" in the physical address space. It allocates per-node static kernel data structures on the corresponding node. There is an alloc_pages_node() API to dynamically allocate per-node memory. The scheduler needs to understand NUMA penalties, there is currently no change to the goodness() value. Hubertus noted that there is some slowdown due to increased overhead when looking at local node. Andrea's scheduler changes are all in common code.

Andrea believes that the kernel should do unconditional kernel text replication. Although there is evidence that 90% of the benefit is gained by replicating a fraction of the code, the amount of memory wasted is too small to worry about. There were problems with lock starvation, and Andrea is considering using a queued lock to address this. Paul offered to start the process of open-sourcing IBM's node-aware locking primitives.

Other areas needing attention include:

MAP_PRIVATE replication
localize per-cpu data structures.
irq routing
NR_CPUS limits
scalability
/proc file to describe machine topology
cpu/node binding (e.g., /proc/pid/cpus_allowed)
shm & page-cache affinity

In response to a question about implementing POSIX thread semantics, Andrea said that a user-level implementation was being considered.

Fujitsu Experience with NUMA and Linux

Slides (Adobe PDF)

Kouichi Kumon presented performance work done on a 2-node (8-CPU) machine with various kernel modifications but with stock apps. The machine had 512MB of memory per node, and was driven by 28 PC client machines. Kouichi looked at per-node memory management and a per-CPU slab cache. See his presentation for many measured results on a number of benchmarks.

Intel Chipset/Platform Supporting NUMA and Linux

Sunil Saxena presented an 8-CPU CC-NUMA system based on low-cost, modular hardware. Such hardware would bring NUMA into the mainstream of development, and is therefore a very exciting prospect. Linus Torvalds is rumored to have later stated that the availability of such hardware would make it impossible for Linux to ignore NUMA.

NEC's Experience with NUMA and Expectation on Linux NUMA Discussion

Fumio Aono described the crossbar-based AzusA system, which has an excellent memory-latency ratio of 1.6. Local latency is less than 200ns, and remote latency is less than 300ns. Each 4-CPU node has up to 32GB of memory, and a system can be constructed with up to 4 nodes, for a maximum of 16 CPUs and 128GB of memory.

Early tests show that up to 80% of the memory traffic goes to node zero, so NUMA tuning is clearly still needed. Some issues and assumptions:

What variety of HW to be covered?
System size vs. reliability constraints.
What topologies are supported?
How are latency ratios and bandwidths represented?
How are cache structures and hierarchies represented?
OS instance must be restricted to CC region of memory.

Kimio Suganuma noted that the technical market is primarily concerned with memory bandwidth, and that the commercial market is primarily concerned with reliability and availability. Kimio expects that the following facilities will be needed to support NUMA on Linux:

Processor, memory, and interrupt affinity.
Text replication.
"Hot page" detection.
Page and process migration.
The ability to request a set of virtual nodes.
The ability to specify memory-allocation policies.
Display of virtual-node information.

HP's Experience with NUMA

Slides (Adobe Acrobat)
HP/UX NUMA API (Memory)
HP/UX NUMA API (Gang Scheduling)
HP/UX Online Man Pages
HP/UX Online Man Pages (Section 2)

Bjorn Helgaas presented HP's experience with NUMA and the historical perspective from the HP/US project.

He described HP's Superdome machine, which interleaves memory so as to mask any NUMA effects. Bjorn said that such masking is a firmware-settable option, so that Superdome can act as a NUMA machine if desired.

Bjorn described the HP/UX NUMA API (see slides).

IBM's Experience with NUMA and Linux

Slides (StarOffice)
Slides (MS PPT)
DYNIX/ptx NUMA API

Paul McKenney reviewed the DYNIX/ptx and AIX NUMA APIs, and presented technological trends, showing that SMP optimizations are first priority. However, NUMA optimizations should be worth a factor of about 1.6 to 4.0 on modern NUMA architectures, and are therefore worth pursuing. Hierarchical NUMA architectures are very likely to start appearing, for example, nodes made up of several multi-CPU chips.

Hubertus Franke presented some preliminary results of work introducing NUMA-awareness into the multi-queue scheduler. This work extends the notion of PROC_CHANGE_PENALTY to provide an additional penalty for moving a task from one node to another. 2:30 Discussion of goals, architectures, priorities of NUMA features, and NUMA APIs. We need to come out of this discussion agreeing that we will work together on -one- NUMA API/implementation for Linux, and it would be good to agree on the highest-priority NUMA features. If we can break out some work items and get people starting work on them, so much the better!

Discussion

There was an animated discussion of generalized NUMA APIs, with the goals being to avoid specifying physical IDs and to use a more declarative form of API. One reason that it is important to avoid physical IDs is to allow two applications to share the same machine--if both have been coded to use CPU or node 0, they will conflict, even though there may be sufficient resources to allow both to run comfortably. One motivation for a more declarative NUMA API is to allow the OS more ability to optimize situations where multiple NUMA applications are running on the same machine. This discussion was quite interesting, and might lead to a very useful result, but had to be cut short in interest of time.

Subsequent discussion divided NUMA functionality into three groups: (1) changes to improve performance on NUMA for non-NUMA-aware applications, (2) changes to provide simple control of resources and execution as would be needed for benchmarks and the like, and (3) changes to allow more general and declarative control of resources and execution. It would be very nice if the APIs for #2 were a nice subset of those for #3. Making this happen will be a challenge, as there will always be considerable sensitivity to anything perceived as excess complexity.

Finally, there was a discussion on different ways of exposing the system's topology via /proc or a similar interface.

Next Meeting

We agreed to meet again at the Ottawa Linux Symposium in July. We will use email contact in the meantime.