Re: your mail (NUMA-on-Linux roadmap)

To: Paul McKenney/Beaverton/IBM@IBMUS cc: andi@suse.de, andrea@suse.de, aono@ed2.com1.fc.nec.co.jp, beckman@turbolabs.com, bjorn_helgaas@hp.com, Hubertus Franke/Watson/IBM@IBMUS, Jerry.Harrow@Compaq.com, jwright@engr.sgi.com, kanoj@engr.sgi.com, kumon@flab.fujitsu.co.jp, norton@mclinux.com, suganuma@hpc.bs1.fc.nec.co.jp, sunil.saxena@intel.com, tbutler@hi.com, woodman@missioncriticallinux.com, mikek@sequent.com, Kenneth Rozendal/Austin/IBM@IBMUS Date: 03/30/01 01:33 PM From: Kanoj Sarcar Subject: Re: your mail (NUMA-on-Linux roadmap) > There is some possibility of hierarchical NUMA architectures where > some memories have lower latency to each other than to other memories > (for example, consider a conventional NUMA system with a tree-shaped > interconnect). If we decide to support this sort of architecture, > there would be some incentive to modify the memory allocation so > that if there is no memory on the desired node, then the other nodes > are searched in increasing order of distance from the requesting task. Exactly. That is what distance[i][j] will be used for. Of course, you can argue that instead we could have something like for (i = 1; i < max_radius; i++) for_each_node_at_distance(targetnode, i) ... where we pass in a target best node, and we iterate over longer and longer distances, identifying nodes, and trying to allocate a page, stopping when we get something. I am assuming a construct like for_each_node_at_distance() or distance[i][j] will _completely_ relieve generic code from looking at specific topologies. Anyone sees why not? > > > o Multipath I/O. Status: efforts in progress. Urgency: low to > > > moderate, > > > depending on architecture. > > > > It wasn't clear to me what the relationship between Multipath I/O > > and NUMA is. True, in case a user program wants to allocate memory near > > a device, he has more choices if the device is multipath. Are there other > > relationships I am missing? > > This is a big part of it. Another part is that you want the kernel to > choose the device that is "closest" to the memory that is being DMAed > from/to. This may be a different device than the one that is "closest" > to the task requesting the I/O. Yes, okay. I will keep this at the back of my mind, although there's a whole chunk of orthogonal work that is needed, before this will become as issue. I agree with your prioritization on this, in fact, I would downgrade it to Low. > > > > o Support for nested NUMA (e.g., nodes composed of several multi-CPU > > > chips). Status: Not yet started. Urgency: moderate, depending > > > on architecture. > > > > Only scheduling optimizations here, right? > > Scheduling optimizations are the most important in all architectures, and > are the only thing that matters on the multi-CPU-chip architectures we > discussed. However, architectures with tree-shaped interconnects can > benefit from hierarchical-NUMA-aware memory allocation, also: if out > of memory on the desired node, one would want to look for memory on the > closest nodes first. That said, I believe that the scheduling > optimizations > are most important in this case. (Thoughts?) Anything other than the hard work of tuning involved here? for_each_node_at_distance() or distance[i][j] should handle CMP (chip multiprocessor) and hierarchical/nested NUMA fine. Right? > > > > o Support for non-tree topologies. There were some comments about > > > this possibility yesterday. Does anyone wish to stand up and be > > > counted? It does add some complexity, so do we really want to > > > support it? > > > > Hmm, IRIX's /hw ie hwgfs is a graph (hardware-graph file system). This > > is because it represents the interconnect between nodes. I remember > > graphs and trees being mentioned while talking about /proc extensions, > > is that what you are talking about here? > > I was thinking of machines that are arranged in ways that cannot be > represented reasonably with a tree. For a (far-fetched) example, consider > a mythical variant of the Intel Paragon machine that was a CC-NUMA rather > than an MPP. This machine has an xy grid interconnect, with nodes at the > intersections. Traffic is routed along the xy grid. > > My personal preference would be to ignore the possibility of this sort > of machine. Anyone willing to speak up on their behalf? > Same as above. for_each_node_at_distance() or distance[i][j] ... > > > > o Binding tasks to CPUs or groups of CPUs (export runon/cpuset > > > functionality). Status: some patches available. Urgency: high. > > > > > > o Binding tasks and related groups of tasks to nodes or groups > > > of nodes. Status: not yet started. Urgency: high. > > > > > > o Binding memory to nodes or groups of nodes. Status: not yet > started. > > > Urgency: high. > > > > I think I covered some reasons why SGI thinks the above are needed. > > Basically, soft partitioning (to reduce impact of one set of programs > > on another), and more repeatibility in performance. Do others have other > > reasons for these features? > Bear with me through out my arguments. It will be easier if you think like someone who knows what NUMA is, but has no experience in tuning NUMA systems. To convince him something needs to be done, you either provide sound logic ... or in its absence, the best you can do is suggest that experience dictates doing certain things helps NUMA. > Hand-tuning of performance-critical applications: get the components of > the app placed so that the components doing the most communication are > on the same quad. Performance-measurement code that measures the latencies > and bandwidths between pairs of CPUs and/or nodes. I know I suggested the latency/bandwidth thing, but in this case, theoretically, all you need is to tell the kernel to put your thread and memory at a certain distance from each other, and let the kernel choose the actual locations. I will still use this logic, but of course, someone might throw this back at me. Diagnostic/firmware tools are still the only obvious example for precise placement. Partitioning and its benefits come next. Performance tools last. > > > I also remember thread wanting to be close to IPC conduit, was the > example > > that two threads were communicating via pipes, and both should be near > the > > kernel maintained buffer for the pipe? Or was it something else? I > thought > > we decided this could be handled by asking the kernel to schedule the > > threads on the same node, so that (hopefully) the kernel buffer would be > > on the same node. > > Although this does work, it has the undesirable effect of preventing > migration > (on systems that benefit from migration). By specifying that a pair of > tasks should be "linked", the kernel is free to migrate tasks as needed > to handle overload situations, while still keeping the pair of tasks close > to each other. Could you explain how this _should_ work? AFAICS, the kernel needs to be told that 2 threads need to be near each other. The kernel does put them near each other. The ipc kernel buffers get allocated near both of them. Then the kernel needs to migrate some processes. Since the threads have been "linked" before, they are migrated and relocated somewhere else, near each other again. What happens to the ipc kernel buffer? Who relocates that? Internally, does the kernel also try to migrate the objects that each of the relocated processes is using? > > > Is there a reason other than the above for two threads to be wanting to > > be on the same cpu (I know IBM has the concept of sharing cache affinity > > between multiple threads)? Or being on the same node (forget node level > > caches, local snoops etc)? > > Those of us working on machines with node-level caches and/or local snoops > are not likely to forget them -- performance measurements will remind us > of them if we try to do so. ;-) I was saying "forget" in the sense that I already know about this, what other reasons are there? > How about a macro that provides a "wrapper" for the matrix? No performance > loss in this case, and the code accessing the matrix would not need to > change > should hot plug require that the matrix change at runtime. Sure. > > > Can we simplify this? In most architectures I have dealt with, you can > > > do pretty well just by counting "hops" through nodes or interconnects. > > > > Yes, that was my expectation too. Platform code is perfectly welcome to > > set distance[i][j] = 1.6 for 1 hop, 4.7 for 2 hops etc though. For most, > > 1 for 1 hop, 2 for 2 hops would work fine, I suspect. > > In these cases, it would be helpful to have a mapping from hops to > distance, > perhaps something like hops_to_dist[i] giving the distance corresponding to > the given number of hops. If I was writing an app that cared about the > detailed performance of the machine, I might be tempted to just measure it! > Thoughts? Hmm, whatever made you think the value of hops_to_dist[i] would just depend on i? It could very well depend on the two end nodes involved in the hop. :-) Or more appropriately, on the characteristics of switches and routers involved in the path. I think having a single distance vector is a good starting point. We can refine it, if an architecture needs it later. Kanoj