Re: your mail (NUMA-on-Linux roadmap)

To: Kanoj Sarcar cc: andi@suse.de, andrea@suse.de, aono@ed2.com1.fc.nec.co.jp, beckman@turbolabs.com, bjorn_helgaas@hp.com, Hubertus Franke/Watson/IBM@IBMUS, Jerry.Harrow@Compaq.com, jwright@engr.sgi.com, kanoj@engr.sgi.com, kumon@flab.fujitsu.co.jp, norton@mclinux.com, suganuma@hpc.bs1.fc.nec.co.jp, sunil.saxena@intel.com, tbutler@hi.com, woodman@missioncriticallinux.com, mikek@sequent.com, Kenneth Rozendal/Austin/IBM Date: 03/29/01 11:59 AM From: Paul McKenney/Beaverton/IBM@IBMUS Subject: Re: your mail (NUMA-on-Linux roadmap) Kanoj, Thank you for looking this over! I will send out an update based on the comments and presentations that I have received. Questions and comments interspersed. > > o Memory-allocation policies (e.g., DISCONTIGMEM, first-touch > > allocation, > > node-specific page allocation, node-specific bootmem allocation, > > read-only mapping replication, per-node kmalloc(), per-node > > kswapd/kreclaimd. Status: mostly done in 2.4 > > Just to be clear: _kernel_ read-only replication has been implemented > on some architectures. Per-node kmalloc has not been, and volunteers > in this area are needed. We need the option of architecture code to > ask for pernode kswapd/kreclaimd (do not assume all NUMA machines > will want pernode daemons, some might want a daemon for four nodes > etc), and this is 2.5 work. Thank you for the specifics, a great improvement over my "mostly". ;-) There is some possibility of hierarchical NUMA architectures where some memories have lower latency to each other than to other memories (for example, consider a conventional NUMA system with a tree-shaped interconnect). If we decide to support this sort of architecture, there would be some incentive to modify the memory allocation so that if there is no memory on the desired node, then the other nodes are searched in increasing order of distance from the requesting task. But this is clearly less urgent than the items you called out. I am marking it as "moderate" rather than "high" urgency. If someone has an architecture that really needs this soon, speak up! > > o Hierarchical scheduling policies. Status: some patches available. > > Urgency: high. > > What is hierarchical scheduling? Handling multilevel caches, nested-numa > etc? If so, as I mentioned, some progress has been made in the sense that > Linus acknowledges the issue and is open to a per-arch scheduling solution. > IE, goodness() will be replaced with arch_goodness() etc. This is the long-term goal we need to reach, but the first step will be to get simple "flat" NUMA-awareness into the scheduler. I have broken this out in my doc. > > o Multipath I/O. Status: efforts in progress. Urgency: low to > > moderate, > > depending on architecture. > > It wasn't clear to me what the relationship between Multipath I/O > and NUMA is. True, in case a user program wants to allocate memory near > a device, he has more choices if the device is multipath. Are there other > relationships I am missing? This is a big part of it. Another part is that you want the kernel to choose the device that is "closest" to the memory that is being DMAed from/to. This may be a different device than the one that is "closest" to the task requesting the I/O. > > o Support for nested NUMA (e.g., nodes composed of several multi-CPU > > chips). Status: Not yet started. Urgency: moderate, depending > > on architecture. > > Only scheduling optimizations here, right? Scheduling optimizations are the most important in all architectures, and are the only thing that matters on the multi-CPU-chip architectures we discussed. However, architectures with tree-shaped interconnects can benefit from hierarchical-NUMA-aware memory allocation, also: if out of memory on the desired node, one would want to look for memory on the closest nodes first. That said, I believe that the scheduling optimizations are most important in this case. (Thoughts?) > > o Support for non-tree topologies. There were some comments about > > this possibility yesterday. Does anyone wish to stand up and be > > counted? It does add some complexity, so do we really want to > > support it? > > Hmm, IRIX's /hw ie hwgfs is a graph (hardware-graph file system). This > is because it represents the interconnect between nodes. I remember > graphs and trees being mentioned while talking about /proc extensions, > is that what you are talking about here? I was thinking of machines that are arranged in ways that cannot be represented reasonably with a tree. For a (far-fetched) example, consider a mythical variant of the Intel Paragon machine that was a CC-NUMA rather than an MPP. This machine has an xy grid interconnect, with nodes at the intersections. Traffic is routed along the xy grid. My personal preference would be to ignore the possibility of this sort of machine. Anyone willing to speak up on their behalf? Any documentation on /hw? How does it compare to the various /proc/node proposals that people were making at the end of the meeting on Monday? If you can send me the docs or a pointer to them, I would be happy to put it up on SourceForge so that everyone can look it over. > > Category 2: Facilities to bind tasks and memory to CPUs and nodes > > > > o Topology discovery (see below for some approaches). > > Status: Not yet started. Urgency: high. > > I suspect this is completely architecture/platform specific. The generic > NUMA code will expect a distance matrix distances[N][N], which the > platform code has to populate. Device-node distance is rarely asked for, > so it is okay to have something like a procedure call which you talk > about later. We may be talking past one another here. I would consider SGI's /hw facility to be an example topology-discovery mechanism (assuming my guesses about how it works are correct). > > o Binding tasks to CPUs or groups of CPUs (export runon/cpuset > > functionality). Status: some patches available. Urgency: high. > > > > o Binding tasks and related groups of tasks to nodes or groups > > of nodes. Status: not yet started. Urgency: high. > > > > o Binding memory to nodes or groups of nodes. Status: not yet started. > > Urgency: high. > > I think I covered some reasons why SGI thinks the above are needed. > Basically, soft partitioning (to reduce impact of one set of programs > on another), and more repeatibility in performance. Do others have other > reasons for these features? Hand-tuning of performance-critical applications: get the components of the app placed so that the components doing the most communication are on the same quad. Performance-measurement code that measures the latencies and bandwidths between pairs of CPUs and/or nodes. > Both repeatibility and partitioning can be addressed by nodesets (which > was an advanced option as I mentioned on the slide, but never got around > to talking about), which I _think_ is easier to understand. Of course, > cpusets/memory-placement gives more detailed control. Whether NUMA uses > cpusets will also be dictated at least partly by whether cpusets become > part of SMP kernels. I agree strongly on the benefits of cpusets and nodesets. I believe we should implement them, and this is what I was referring to by "groups of nodes". Other thoughts? > > o General specification of "hints" that indicate relationships > > between different objects, including tasks, a given task's > > data/bss/stack/heap ("process" to the proprietary-Unix people > > among us), a given shm/mmap area, a given device, > > a given CPU, a given node, and a given IPC conduit. > > Status: Needs definition, research, prototyping, and > > experimentation. Urgency: Hopefully reasonably low, > > given the complexities that would be involved if not done > > correctly. > > I remember we talked about kernel-to-user copying, but we decided it was > a non issue, right? The thread could just ask to be close to the device. Agreed! > I also remember thread wanting to be close to IPC conduit, was the example > that two threads were communicating via pipes, and both should be near the > kernel maintained buffer for the pipe? Or was it something else? I thought > we decided this could be handled by asking the kernel to schedule the > threads on the same node, so that (hopefully) the kernel buffer would be > on the same node. Although this does work, it has the undesirable effect of preventing migration (on systems that benefit from migration). By specifying that a pair of tasks should be "linked", the kernel is free to migrate tasks as needed to handle overload situations, while still keeping the pair of tasks close to each other. > Is there a reason other than the above for two threads to be wanting to > be on the same cpu (I know IBM has the concept of sharing cache affinity > between multiple threads)? Or being on the same node (forget node level > caches, local snoops etc)? Those of us working on machines with node-level caches and/or local snoops are not likely to forget them -- performance measurements will remind us of them if we try to do so. ;-) A workload with a reasonably small number of non-CPU-bound tasks that either communicate a lot via some sort of IPC or shared memory will see impressive performance gains if the related threads are scheduled on the same CPU or on the same node. Especially in the IPC case, if there are kernel buffers involved, or if the tasks' memory is being allocated node-local, these performance gains will be seen even in the absence of node-level caches and local snoops. > > So where do you get the distance information from? I believe that > > this must be supplied by architecture-specific functions: on some > > systems, the firmware might supply this information, while on others, > > the architecture-specific Linux code will likely have to make > > assumptions based on the system structure. Would an API like the > > following be appropriate? > > > > int numa_distance(char *obj1, char *obj2) > > > > where obj1 and obj2 are the names of the CPUs or memories for which > > distance information is desired. > > Much simpler to have a matrix populated at boot up time. (Lets not worry > right now about hot plug, we can move to a procedure call if hot plug > demands it). How about a macro that provides a "wrapper" for the matrix? No performance loss in this case, and the code accessing the matrix would not need to change should hot plug require that the matrix change at runtime. > > All these proposals leave open the question of how to represent > > distance information from memory and CPU to devices. > > Chances are, we are not going to use device distances. If the user wants > to allocate memory near device D1, the kernel will look at the device > handle (what this will end up to be is an open issue, ie, is it ascii > name, devfs handle, pci_dev * etc) and figure out the controlling/master > node (with something like master_node(void *device_handle), and try to > allocate memory on that node (falling back on the next best node based > on inter-node distances). Similarly for user wanting to be on a cpu > close to device. > > If it really came to needing device distances, I would suggest something > like > device_distance(void *device_handle, char *kvaddr) > > to find distance from device to some piece of memory. This would be needed for multipath I/O, to figure out which of a number of devices would be best able to DMA from a given piece of memory. > This raises an interesting thing that I am taking for granted. All ccNUMA > machines use the same bus/network/switches for PIO, DMA and internode > memory transfer paths. So, the PIO distance of a cpu to a device would be > the same as the DMA distance from the containing node to the device, > also equal to the memory distance from any cpu on the containing node > to the master node of the device. Is there an architecture that does not > fit into this simplified picture? All commercially successful machines that I am aware of fit into this picture. However, if things like Infiniband are successful, this may change. One way to deal with this sort of situation would be to have a facility for computing distances to devices. Another would be to use multipath I/O techniques. Thoughts? > > Can we simplify this? In most architectures I have dealt with, you can > > do pretty well just by counting "hops" through nodes or interconnects. > > Yes, that was my expectation too. Platform code is perfectly welcome to > set distance[i][j] = 1.6 for 1 hop, 4.7 for 2 hops etc though. For most, > 1 for 1 hop, 2 for 2 hops would work fine, I suspect. In these cases, it would be helpful to have a mapping from hops to distance, perhaps something like hops_to_dist[i] giving the distance corresponding to the given number of hops. If I was writing an app that cared about the detailed performance of the machine, I might be tempted to just measure it! Thoughts? Again, thank you for your careful review of the document! Thanx, Paul