WG: Linux on NUMA

July 25, 2001
Ottawa Linux Symposium
Agenda

Discussion on lse-tech@lists.sourceforge.net.

Attendees

Name	Email
Andrea Arcangeli	andrea@suse.de
Hubertus Franke	frankeh@us.ibm.com
Jerry Harrow	Jerry.Harrow@Compaq.com
Andi Kleen	ak@suse.de
Paul E. McKenney	pmckenne@us.ibm.com
Kimio Suganama	suganuma@hpc.bs1.fc.nec.co.jp
Nick Pollitt	npollitt@sgi.com
Gerrit Huizenga	gerrit@us.ibm.com
Paul Dorwin	pdorwin@us.ibm.com
Martin Bligh	mbligh@us.ibm.com
Pat Gaughen	gaughen@us.ibm.com
Ken Rozendal	kenroz@us.ibm.com
And many others...
Let me know who you are and I will add you.

Minutes

Agenda Bashing

Most of this time was consumed getting the LCD projector talking to the laptop, but then again, most of the bashing happened via email prior to the meeting. ;-)

SGI Experience with NUMA and Linux

Nick Pollitt reviewed SGI's proposal (and update) for cpumemsets that Paul Jackson put forward on the lse-tech mailing list.

NUMA-Aware Scheduling

Slides (Adobe PDF)
Slides (PowerPoint)
Slides (StarOffice)

Hubertus Franke described the work that he and the rest of the MQ scheduler team have done using CPU pooling to improve performance on NUMA machines. The idea is to define pools of CPUs that match the NUMA topology of the machine.

NUMA-Aware Locking

Slides (Adobe PDF)
Slides (PowerPoint)
Slides (StarOffice)

Paul McKenney described the work that Jack Vogel and Swaminathan Sivasubramanian have done porting the DYNIX/ptx "jlock" NUMA-aware locking primitives to Linux.

The purpose of NUMA-aware locks is to act as a short-term "guard rail" for locks that might suffer from high contention. In the long term, of course, the only appropriate action is to redesign whatever is causing the high contention. However, in some cases, it can take a lot of time and effort to arrive at an algorithm that is both simple and scalable. In the meantime, NUMA-aware locks can provide some relief from contention and starvation effects.

Lock starvation can occur in many NUMA systems. Consider the system shown in the diagram:
This is a NUMA system with four CPUs per node, CPUs 0, 1, 2, and 3 on Node 0, and CPUs 4, 5, 6, and 7 on Node 1. The most prominent property of NUMA systems is that references to memory or cache lines within a node are faster than references to memory or cache lines from one node to another. Suppose that a lock is contented strongly enough that by the time a given CPU (for example, CPU 0) releases the lock, there is at least one CPU on that same node (for example, CPU 1) spinning on the lock. Because memory references are faster within nodes than between nodes, CPU 1 will learn that the lock is free more quickly than will CPUs 4-7 on the other node. Therefore, CPU 1 will most likely be the next CPU to acquire the lock.

If this high level of contention persists, the lock will always be acquired by one of CPUs 0-3, starving CPUs 4-7. From the perspective of a user with a process running on one of CPUs 4-7, the system will appear to be hung.

This lock starvation is clearly undesirable. Fortunately, the well-known queued locks can prevent starvation, since they grant locks in strict FIFO order. Unfortunately, queued locks are too fair. To see this, suppose that locks are requested by CPUs 0, 4, 1, 5, 2, 6, 3, and 7, in that order. A queued lock would grant the locks in this order, which would result in the lock and all the data that it protects being moved from one node to the other seven times. It would be more efficient to grant the lock to CPUs 0, 1, 2, and 3, then to CPUs 4, 5, 6, and 7, which would move the lock and data between nodes only once, thereby increasing system performance.

Thus, the goal of NUMA-aware locks is to balance fairness and performance. NUMA-aware locks keep the lock on a given node, but for a limited time. The numalock patch achieves this by granting the lock to the next-highest numbered CPU. As long as the CPUs are numbered in node order, this will service all pre-existing requests on a given node before moving to the next node, but will avoid giving the lock to the same CPU twice before advancing to the next CPU.

Jack and Swami have released a patch for numalock. More performance testing is needed.

Andrea Arcangeli noted that numalock adds some instructions to the interrupt fastpath. These instructions prevent the lock from being handed off to a CPU that is currently interrupted. Andrea pointed out that it is entirely possible that the overhead of these extra instructions outweighs the savings due to numalock, particularly on systems with low lock contention, and asked that an additional patch be produced that adds nothing to the interrupt fastpath, in order to allow performance to be compared.

Nick Pollitt later passed on a suggestion from John Hawkes, who would like to see a "fastpath" for high-priority lock requests. The idea would be that normal-priority requests would queue in the normal manner, but that high-priority requests would jump to the front of the queue. Of course, if all requests were high priority, the NUMA-aware nature of the lock would be lost, for example, lock starvation could occur in this case. Therefore, this priority approach is useful only if almost all requests are low priority.

Porting Linux to NUMA-Q

Slides (Adobe PDF)
Slides (PowerPoint)
Slides (StarOffice)

Martin Bligh described his experience porting Linux to a NUMA-Q machine. Linux now boots on a 4-node, 16-CPU NUMA-Q. Challenges included APIC programming and getting the boot sequence right. Interestingly enough, one must set the NUMA-Q firmware into Windows NT mode in order to boot Linux...

Porting Kanoj's Topology-Discovery Patch to NUMA-A

Slides (Adobe PDF)
Slides (PowerPoint)
Slides (StarOffice)

Paul Dorwin described his experience porting Kanoj's patch for /proc topology discovery to NUMA-Q. In addition, Paul described his implementation of an access mechanism to NUMA-Q's hardware performance counters.

Discussion of NUMA Work Items and APIs

Slides (Adobe PDF)
Slides (PowerPoint)
Slides (StarOffice)

We reviewed the work items in the various categories, as shown on the web page.

There was some discussion of replicating file-cache pages, but it turns out that the unified-page-cache design of Linux means that replicating read-only mappings implies replication of file-cache pages. This lead to the realization that the meaning of "read only" is a bit tenuous, since someone might change file permissions on the fly. This in turn seems to imply that it is necessary for write() to handle the case of a replicated file. However, this should be infrequent, so it should not be necessary to handle it efficiently. Therefore, a simple strategy, such as simply collapsing the replicas any time that there is a write, should be sufficient.

The belief that page migration has not been all that useful was revisited and reaffirmed during and after the session. Nick Pollitt noted that SGI has seen some benefit from per-page migration in IRIX on multithreaded processes, but that rather sophisticated heuristics were used. This seems to favor looking at process migration as a first cut, with investigation of process-centric software page migration afterwards.

It was noted that the dynamic reconfiguration version of topology discovery can be put off until such time as Linux does dynzmic reconfiguration (e.g., inserting a new node into a running system without rebooting).

Reaffirmed that binding to I/O devices could be put off, since a valid workaround is to bind to the node containing the I/O devices on all currently known architectures.

Discussed Paul Jackson's cpumemset proposal, and it appears that this proposal addresses the "Specify Memory Locality Domains" and the "Virtualize Resource Names" elements of Abstract Binding, leaving the "Link Tasks Together" and "Linking/Binding Hints" still in need of a proposal.

Reviewed the strawman API posted on SourceForge. A number of issues were noted:

bindmemory() should instead be implemented as an option or set of options to the existing madvise() system call. In addition, this system call needs some indication of "looseness" and "strictness". If a program makes a "strict" call, it is willing to block to get the memory, even if there is memory available on some undesirable node. In contrast, a "loose" call will return memory as long as there is any somewhere in the system.
bindtocpu() and bindtonode() should take a mask as their first argument, in order to allow sets of CPUs and nodes to be specified.
The setlaunch() API interacts with threading in some yet-as-undefined way. One issue is that per-thread as well as per-process launch options would be desireable. A quick review of the HP manual pages revealed that HP-UX had both concepts, so this separation will be added to the Linux API. HP-UX also supports the notion of getting the current launch options, so this will also be considered.
Discussed a /proc interface, including a /proc/pid/nodes_allowed to pin down memory, and a /proc/pid/cpus_owned to run a given task on a selected set of CPUs, excluding all other tasks from these CPUs. (But what if multiple tasks "fight" over a given CPU?)

Paul McKenney will update the NUMA API as noted above, and post it on SourceForge.

Next Meeting

TBD. Will set up via e-mail. In the meantime, let's get lots of real work done!