Name | |
Fumio Aono | aono@ed2.com1.fc.nec.co.jp |
Andrea Arcangeli | andrea@suse.de |
Pete Beckman | beckman@turbolabs.com |
Tom Butler | tbutler@hi.com |
Hubertus Franke | frankeh@us.ibm.com |
Jerry Harrow | Jerry.Harrow@Compaq.com |
Bjorn Helgaas | bjorn_helgaas@hp.com |
Andi Kleen | ak@suse.de |
Kouichi Kumon | kumon@flab.fujitsu.co.jp |
Asit Mallick | asit.mallick@intel.com |
Paul E. McKenney | pmckenne@us.ibm.com |
Dave Norton | dnorton@mclinux.com |
Kanoj Sarcar | kanoj@engr.sgi.com |
Sunil Saxena | sunil.saxena@intel.com |
Kimio Suganuma | suganuma@hpc.bs1.fc.nec.co.jp |
Larry Woodman | woodman@missioncriticallinux.com |
John Wright | jwright@engr.sgi.com |
Kanoj Sarcar and John Wright presented SGI's experience with NUMA and Linux and proposed areas of focus for Linux NUMA effort. They also presented historical perspective from IRIX.
Kanoj discussed a global page stealer, which looks at all nodes. There are some problems with per-node memory exhaustion.
The scheduler looks at the home node if affinity is exhausted. cpusets are used to get repeatable benchmark runs.
A hardware graph to represent NUMA topology. "NUIA" is shorthand for "non-uniform I/O accesses".
Irix makes heavy use of affinity (also called "binding") for framebuffers and in benchmarks. Andi Kleen expects this affinity will also be useful for networking. There is no compiler-based affinity, instead, user level tools are used to force affinity manually if the default OS behavior is undesirable. There was some question about how beneficial binding would be, but a number of people felt that it would be especially helpful for direct-connect devices, and would be somewhat helpful even for switch devices, such as FibreChannel storage-area networks (SANs).
It turns out that Irix replicates only text pages. Andrea suggested making decision at mmap() time, leveraging the work done in Linux to handle file truncation and mmap.
Irix handles System-V shared memory by using address-range-wide policy to specify characteristics. Migration has not proven useful on Irix workloads, but Larry Woodman pointed out that mileage may vary running different workloads on different systems. In addition, Irix uses page-by-page migration (or larger multipage chunks) with hardware assist, while some other OSes use process-at-a-time migration.
Kanoj does not recommend a hard-and-fast definition of what a node contains, instead allowing CPU-only, memory-only, or even I/O-only nodes.
Kanoj has a patch for spreading bootmem across nodes. This is important in some architectures that have limited amounts of low memory. In such architectures, consuming all the low memory on one of the nodes will cause trouble (e.g., with device drivers) later on.
Kanoj noted that replication requires changes to the kernel debugger, /dev/kmem, and dprobes. For example, if kernel text is replicated, and someone uses /dev/kmem to modify kernel text, the modification must be performed on all the replicas. In addition, allocation failures on a given node must allow guilt-free retries on other nodes. Currently, any sort of failure kicks off kswapd, kreclaimd, and other undesirable global operations. Andrea noted that the required strategy for per-node memory exhaustion may be architecture dependent.
Text replication makes it impossible to use the current 1-to-1 mapping. In addition, it may be necessary to vary granularity to handle large pages. Per-CPU irq lines are desirable to avoid problems with devices in different nodes wanting to use the same irq. There are potential issues with packet misordering if packets for the same TCP connection are received on different nodes, but Linux already suffers packet misordering because it uses per-cpu rrupt queues. There are fixes to TCP in the works to better handle packet misordering.
Is it reasonable to allow users to make "stupid" requests of the system, for example to put a given task's memory all on one node, but only allow it to execute on CPUs contained in another node. The general consensus was that it is reasonable to allow diagnostic and hardware-performance-measurement software to do this.
Discussion of what sorts of NUMA topologies should be supported resulted in the following issues:
Kanoj noted that kernel placement works very well in absence of dynamic creation and deletion of threads. There are many scientific workloads that run with a known fixed number of threads, but there are others that dynamically create and destroy threads in unpredictable ways.
Andrea Arcangeli described his experience with NUMA on Compaq's Wildfire (DEC Alpha-based) system. This system has 330ns local and 960ns remote memory latency. The general approach is to get the system tuned so that it can make use of the full local-memory bandwidth.
Andrea started the port in March 2000, and was up and running on a GS320 at the end of 2000.
The modified kernel allocates memory from the local node first, then falls back to other nodes. It supports "holes" in the physical address space. It allocates per-node static kernel data structures on the corresponding node. There is an alloc_pages_node() API to dynamically allocate per-node memory. The scheduler needs to understand NUMA penalties, there is currently no change to the goodness() value. Hubertus noted that there is some slowdown due to increased overhead when looking at local node. Andrea's scheduler changes are all in common code.
Andrea believes that the kernel should do unconditional kernel text replication. Although there is evidence that 90% of the benefit is gained by replicating a fraction of the code, the amount of memory wasted is too small to worry about. There were problems with lock starvation, and Andrea is considering using a queued lock to address this. Paul offered to start the process of open-sourcing IBM's node-aware locking primitives.
Other areas needing attention include:
In response to a question about implementing POSIX thread semantics, Andrea said that a user-level implementation was being considered.
Kouichi Kumon presented performance work done on a 2-node (8-CPU) machine with various kernel modifications but with stock apps. The machine had 512MB of memory per node, and was driven by 28 PC client machines. Kouichi looked at per-node memory management and a per-CPU slab cache. See his presentation for many measured results on a number of benchmarks.
Early tests show that up to 80% of the memory traffic goes to node zero, so NUMA tuning is clearly still needed. Some issues and assumptions:
Bjorn Helgaas presented HP's experience with NUMA and the historical perspective from the HP/US project.
He described HP's Superdome machine, which interleaves memory so as to mask any NUMA effects. Bjorn said that such masking is a firmware-settable option, so that Superdome can act as a NUMA machine if desired.
Bjorn described the HP/UX NUMA API (see slides).
Paul McKenney reviewed the DYNIX/ptx and AIX NUMA APIs, and presented technological trends, showing that SMP optimizations are first priority. However, NUMA optimizations should be worth a factor of about 1.6 to 4.0 on modern NUMA architectures, and are therefore worth pursuing. Hierarchical NUMA architectures are very likely to start appearing, for example, nodes made up of several multi-CPU chips.
Hubertus Franke presented some preliminary results of work introducing NUMA-awareness into the multi-queue scheduler. This work extends the notion of PROC_CHANGE_PENALTY to provide an additional penalty for moving a task from one node to another. 2:30 Discussion of goals, architectures, priorities of NUMA features, and NUMA APIs. We need to come out of this discussion agreeing that we will work together on -one- NUMA API/implementation for Linux, and it would be good to agree on the highest-priority NUMA features. If we can break out some work items and get people starting work on them, so much the better!
There was an animated discussion of generalized NUMA APIs, with the goals being to avoid specifying physical IDs and to use a more declarative form of API. One reason that it is important to avoid physical IDs is to allow two applications to share the same machine--if both have been coded to use CPU or node 0, they will conflict, even though there may be sufficient resources to allow both to run comfortably. One motivation for a more declarative NUMA API is to allow the OS more ability to optimize situations where multiple NUMA applications are running on the same machine. This discussion was quite interesting, and might lead to a very useful result, but had to be cut short in interest of time.
Subsequent discussion divided NUMA functionality into three groups: (1) changes to improve performance on NUMA for non-NUMA-aware applications, (2) changes to provide simple control of resources and execution as would be needed for benchmarks and the like, and (3) changes to allow more general and declarative control of resources and execution. It would be very nice if the APIs for #2 were a nice subset of those for #3. Making this happen will be a challenge, as there will always be considerable sensitivity to anything perceived as excess complexity.
Finally, there was a discussion on different ways of exposing the system's topology via /proc or a similar interface.