Discussion on lse-tech@lists.sourceforge.net.
Name | |
Andrea Arcangeli | andrea@suse.de |
Hubertus Franke | frankeh@us.ibm.com |
Jerry Harrow | Jerry.Harrow@Compaq.com |
Andi Kleen | ak@suse.de |
Paul E. McKenney | pmckenne@us.ibm.com |
Kimio Suganama | suganuma@hpc.bs1.fc.nec.co.jp |
Nick Pollitt | npollitt@sgi.com |
Gerrit Huizenga | gerrit@us.ibm.com |
Paul Dorwin | pdorwin@us.ibm.com |
Martin Bligh | mbligh@us.ibm.com |
Pat Gaughen | gaughen@us.ibm.com |
Ken Rozendal | kenroz@us.ibm.com |
And many others... | |
Let me know who you are and I will add you. |
Hubertus Franke described the work that he and the rest of the MQ scheduler team have done using CPU pooling to improve performance on NUMA machines. The idea is to define pools of CPUs that match the NUMA topology of the machine.
Paul McKenney described the work that Jack Vogel and Swaminathan Sivasubramanian have done porting the DYNIX/ptx "jlock" NUMA-aware locking primitives to Linux.
The purpose of NUMA-aware locks is to act as a short-term "guard rail" for locks that might suffer from high contention. In the long term, of course, the only appropriate action is to redesign whatever is causing the high contention. However, in some cases, it can take a lot of time and effort to arrive at an algorithm that is both simple and scalable. In the meantime, NUMA-aware locks can provide some relief from contention and starvation effects.
Lock starvation can occur in many NUMA systems.
Consider the system shown in the diagram:
This is a NUMA system with four CPUs per node, CPUs 0, 1, 2, and 3
on Node 0, and CPUs 4, 5, 6, and 7 on Node 1.
The most prominent property of NUMA systems is that references
to memory or cache lines within a node are faster than references
to memory or cache lines from one node to another.
Suppose that a lock is contented strongly enough that by the time a given CPU
(for example, CPU 0) releases the lock, there is at least one CPU on that same
node (for example, CPU 1) spinning on the lock.
Because memory references are faster within nodes than between nodes,
CPU 1 will learn that the lock is free more
quickly than will CPUs 4-7 on the other node.
Therefore, CPU 1 will most likely be the next CPU to acquire the lock.
If this high level of contention persists, the lock will always be acquired by one of CPUs 0-3, starving CPUs 4-7. From the perspective of a user with a process running on one of CPUs 4-7, the system will appear to be hung.
This lock starvation is clearly undesirable. Fortunately, the well-known queued locks can prevent starvation, since they grant locks in strict FIFO order. Unfortunately, queued locks are too fair. To see this, suppose that locks are requested by CPUs 0, 4, 1, 5, 2, 6, 3, and 7, in that order. A queued lock would grant the locks in this order, which would result in the lock and all the data that it protects being moved from one node to the other seven times. It would be more efficient to grant the lock to CPUs 0, 1, 2, and 3, then to CPUs 4, 5, 6, and 7, which would move the lock and data between nodes only once, thereby increasing system performance.
Thus, the goal of NUMA-aware locks is to balance fairness and performance. NUMA-aware locks keep the lock on a given node, but for a limited time. The numalock patch achieves this by granting the lock to the next-highest numbered CPU. As long as the CPUs are numbered in node order, this will service all pre-existing requests on a given node before moving to the next node, but will avoid giving the lock to the same CPU twice before advancing to the next CPU.
Jack and Swami have released a patch for numalock. More performance testing is needed.
Andrea Arcangeli noted that numalock adds some instructions to the interrupt fastpath. These instructions prevent the lock from being handed off to a CPU that is currently interrupted. Andrea pointed out that it is entirely possible that the overhead of these extra instructions outweighs the savings due to numalock, particularly on systems with low lock contention, and asked that an additional patch be produced that adds nothing to the interrupt fastpath, in order to allow performance to be compared.
Nick Pollitt later passed on a suggestion from John Hawkes, who would like to see a "fastpath" for high-priority lock requests. The idea would be that normal-priority requests would queue in the normal manner, but that high-priority requests would jump to the front of the queue. Of course, if all requests were high priority, the NUMA-aware nature of the lock would be lost, for example, lock starvation could occur in this case. Therefore, this priority approach is useful only if almost all requests are low priority.
Martin Bligh described his experience porting Linux to a NUMA-Q machine. Linux now boots on a 4-node, 16-CPU NUMA-Q. Challenges included APIC programming and getting the boot sequence right. Interestingly enough, one must set the NUMA-Q firmware into Windows NT mode in order to boot Linux...
Paul Dorwin described his experience porting Kanoj's patch for /proc topology discovery to NUMA-Q. In addition, Paul described his implementation of an access mechanism to NUMA-Q's hardware performance counters.
We reviewed the work items in the various categories, as shown on the web page.
There was some discussion of replicating file-cache pages, but it turns out that the unified-page-cache design of Linux means that replicating read-only mappings implies replication of file-cache pages. This lead to the realization that the meaning of "read only" is a bit tenuous, since someone might change file permissions on the fly. This in turn seems to imply that it is necessary for write() to handle the case of a replicated file. However, this should be infrequent, so it should not be necessary to handle it efficiently. Therefore, a simple strategy, such as simply collapsing the replicas any time that there is a write, should be sufficient.
The belief that page migration has not been all that useful was revisited and reaffirmed during and after the session. Nick Pollitt noted that SGI has seen some benefit from per-page migration in IRIX on multithreaded processes, but that rather sophisticated heuristics were used. This seems to favor looking at process migration as a first cut, with investigation of process-centric software page migration afterwards.
It was noted that the dynamic reconfiguration version of topology discovery can be put off until such time as Linux does dynzmic reconfiguration (e.g., inserting a new node into a running system without rebooting).
Reaffirmed that binding to I/O devices could be put off, since a valid workaround is to bind to the node containing the I/O devices on all currently known architectures.
Discussed Paul Jackson's cpumemset proposal, and it appears that this proposal addresses the "Specify Memory Locality Domains" and the "Virtualize Resource Names" elements of Abstract Binding, leaving the "Link Tasks Together" and "Linking/Binding Hints" still in need of a proposal.
Reviewed the strawman API posted on SourceForge. A number of issues were noted: