To: Kanoj Sarcar
cc: andi@suse.de, andrea@suse.de, aono@ed2.com1.fc.nec.co.jp, beckman@turbolabs.com, bjorn_helgaas@hp.com, Hubertus Franke/Watson/IBM@IBMUS, Jerry.Harrow@Compaq.com, jwright@engr.sgi.com, kanoj@engr.sgi.com, Kenneth Rozendal/Austin/IBM@IBMUS, kumon@flab.fujitsu.co.jp, mikek@sequent.com, norton@mclinux.com, suganuma@hpc.bs1.fc.nec.co.jp, sunil.saxena@intel.com, tbutler@hi.com, woodman@missioncriticallinux.com
Date: 03/30/01 11:24 PM
From: Paul McKenney/Beaverton/IBM@IBMUS
Subject: Re: your mail (NUMA-on-Linux roadmap)
> > There is some possibility of hierarchical NUMA architectures where
> > some memories have lower latency to each other than to other memories
> > (for example, consider a conventional NUMA system with a tree-shaped
> > interconnect). If we decide to support this sort of architecture,
> > there would be some incentive to modify the memory allocation so
> > that if there is no memory on the desired node, then the other nodes
> > are searched in increasing order of distance from the requesting task.
>
> Exactly. That is what distance[i][j] will be used for. Of course, you
> can argue that instead we could have something like
>
> for (i = 1; i < max_radius; i++)
> for_each_node_at_distance(targetnode, i)
> ...
>
> where we pass in a target best node, and we iterate over longer and
> longer distances, identifying nodes, and trying to allocate a page,
> stopping when we get something.
>
> I am assuming a construct like for_each_node_at_distance() or
> distance[i][j] will _completely_ relieve generic code from looking
> at specific topologies. Anyone sees why not?
Answering this will take some thought. Here are the operations that I
believe will be needed (I am using "diameter" in the graph-theoretic
sense here):
1. Find the distance between a pair of nodes.
2. Given a node, find the closest other node.
3. Given a node, find the closest other node that has some resource
or attribute (e.g., has more than a specified amount of memory
available).
4. Find the largest clique of nodes of diameter less than some
specified distance (but maybe you have a better way of satisfying
the constraints of needing more than some number of CPUs "close"
to each other).
5. Ditto, but some nodes are excluded from consideration
(e.g., because they are being used by other applications,
don't have enough memory, etc.).
6. Find the set of nodes of the specified size that has the
smallest diameter.
7. Ditto, but some nodes are excluded from consideration.
8. Given a set of nodes, find the additional node that results
in the smallest diameter of the resulting augmented set.
This would be needed when expanding resources available to
an application.
Are there others? Are there some that I have listed that really
are not needed? Other thoughts/approaches to determine what data
is needed and/or most convenient?
> > > > o Multipath I/O. Status: efforts in progress. Urgency: low to
> > > > moderate,
> > > > depending on architecture.
> > >
> > > It wasn't clear to me what the relationship between Multipath I/O
> > > and NUMA is. True, in case a user program wants to allocate memory near
> > > a device, he has more choices if the device is multipath. Are there other
> > > relationships I am missing?
> >
> > This is a big part of it. Another part is that you want the kernel to
> > choose the device that is "closest" to the memory that is being DMAed
> > from/to. This may be a different device than the one that is "closest"
> > to the task requesting the I/O.
>
> Yes, okay. I will keep this at the back of my mind, although there's
> a whole chunk of orthogonal work that is needed, before this will become
> as issue. I agree with your prioritization on this, in fact, I would
> downgrade it to Low.
At least some of the orthogonal driver-related work is being worked on.
In the worst case, all of the MPIO code could be put into specialized
drivers, but it would be nice for common code to be in the kernel.
But those of us with architectures that are helped by MPIO must be
prepared to deal with the possibility that little or no MPIO code is
accepted into the kernel!
> > > > o Support for nested NUMA (e.g., nodes composed of several multi-CPU
> > > > chips). Status: Not yet started. Urgency: moderate, depending
> > > > on architecture.
> > >
> > > Only scheduling optimizations here, right?
> >
> > Scheduling optimizations are the most important in all architectures, and
> > are the only thing that matters on the multi-CPU-chip architectures we
> > discussed. However, architectures with tree-shaped interconnects can
> > benefit from hierarchical-NUMA-aware memory allocation, also: if out
> > of memory on the desired node, one would want to look for memory on the
> > closest nodes first. That said, I believe that the scheduling
> > optimizations
> > are most important in this case. (Thoughts?)
>
> Anything other than the hard work of tuning involved here?
> for_each_node_at_distance() or distance[i][j] should handle CMP (chip
> multiprocessor) and hierarchical/nested NUMA fine. Right?
Seems plausible, but we do need to make sure we understand all the
operations that need to be done based on distance. Been surprised
too many times in my life, what can I say? ;-)
> > > > o Support for non-tree topologies. There were some comments about
> > > > this possibility yesterday. Does anyone wish to stand up and be
> > > > counted? It does add some complexity, so do we really want to
> > > > support it?
> > >
> > > Hmm, IRIX's /hw ie hwgfs is a graph (hardware-graph file system). This
> > > is because it represents the interconnect between nodes. I remember
> > > graphs and trees being mentioned while talking about /proc extensions,
> > > is that what you are talking about here?
> >
> > I was thinking of machines that are arranged in ways that cannot be
> > represented reasonably with a tree. For a (far-fetched) example, consider
> > a mythical variant of the Intel Paragon machine that was a CC-NUMA rather
> > than an MPP. This machine has an xy grid interconnect, with nodes at the
> > intersections. Traffic is routed along the xy grid.
> >
> > My personal preference would be to ignore the possibility of this sort
> > of machine. Anyone willing to speak up on their behalf?
> >
>
> Same as above. for_each_node_at_distance() or distance[i][j] ...
Again, sounds plausible. What about potential machines that have memory,
CPUs, and/or I/O devices segregated? Should we ignore this possibility?
> > > > o Binding tasks to CPUs or groups of CPUs (export runon/cpuset
> > > > functionality). Status: some patches available. Urgency: high.
> > > >
> > > > o Binding tasks and related groups of tasks to nodes or groups
> > > > of nodes. Status: not yet started. Urgency: high.
> > > >
> > > > o Binding memory to nodes or groups of nodes. Status: not yet
> > started.
> > > > Urgency: high.
> > >
> > > I think I covered some reasons why SGI thinks the above are needed.
> > > Basically, soft partitioning (to reduce impact of one set of programs
> > > on another), and more repeatibility in performance. Do others have other
> > > reasons for these features?
>
> Bear with me through out my arguments. It will be easier if you think
> like someone who knows what NUMA is, but has no experience in tuning
> NUMA systems. To convince him something needs to be done, you either
> provide sound logic ... or in its absence, the best you can do is
> suggest that experience dictates doing certain things helps NUMA.
Excellent point, it will be necessary to explain all this to people
who do not have deep experience working with NUMA.
> > Hand-tuning of performance-critical applications: get the components of
> > the app placed so that the components doing the most communication are
> > on the same quad. Performance-measurement code that measures the latencies
> > and bandwidths between pairs of CPUs and/or nodes.
>
> I know I suggested the latency/bandwidth thing, but in this case,
> theoretically, all you need is to tell the kernel to put your thread
> and memory at a certain distance from each other, and let the kernel
> choose the actual locations. I will still use this logic, but of course,
> someone might throw this back at me.
>
> Diagnostic/firmware tools are still the only obvious example for precise
> placement. Partitioning and its benefits come next. Performance tools last.
Benchmarking in cases that the scheduling heuristics handle poorly.
(Of course, for people doing benchmarks, a 5% performance deficit can
qualify as "handle poorly".) This has been the most frequent use of
hand-placing/binding things in ptx. It turns out that most real customer
applications are complex and dynamic enough that hand tuning does not help
much. Benchmarks, however, -are- often helped by hand tuning, and there
are some (not many, but some) real customer apps that are helped by hand
tuning. In addition, hand tuning/placement was sometimes used to work
around performance bugs (e.g., poor heuristics) while they were being
fixed.
> > > I also remember thread wanting to be close to IPC conduit, was the
> > example
> > > that two threads were communicating via pipes, and both should be near
> > the
> > > kernel maintained buffer for the pipe? Or was it something else? I
> > thought
> > > we decided this could be handled by asking the kernel to schedule the
> > > threads on the same node, so that (hopefully) the kernel buffer would be
> > > on the same node.
> >
> > Although this does work, it has the undesirable effect of preventing
> > migration
> > (on systems that benefit from migration). By specifying that a pair of
> > tasks should be "linked", the kernel is free to migrate tasks as needed
> > to handle overload situations, while still keeping the pair of tasks close
> > to each other.
>
> Could you explain how this _should_ work? AFAICS, the kernel needs to
> be told that 2 threads need to be near each other. The kernel does put
> them near each other. The ipc kernel buffers get allocated near both of
> them. Then the kernel needs to migrate some processes. Since the threads
> have been "linked" before, they are migrated and relocated somewhere
> else, near each other again. What happens to the ipc kernel buffer? Who
> relocates that? Internally, does the kernel also try to migrate the
> objects that each of the relocated processes is using?
In ptx, there are two cases: 1) there is a fixed per-connection IPC
buffer, which will be small enough to reside comfortably in the per-node
cache, and 2) there is no fixed buffer (think TCP), in which case, the
STREAMS buffers/mbufs will be allocated "close" to the two threads.
In the second case, there may be a short interval during which the
STREAMS buffers/mbufs are in the wrong place, but once these are
consumed, all is well. One could of course imagine migrating a fixed
IPC buffer, but this seems like something one tunes if workloads
demonstrate that it is valuable.
> > > Is there a reason other than the above for two threads to be wanting to
> > > be on the same cpu (I know IBM has the concept of sharing cache affinity
> > > between multiple threads)? Or being on the same node (forget node level
> > > caches, local snoops etc)?
> >
> > Those of us working on machines with node-level caches and/or local snoops
> > are not likely to forget them -- performance measurements will remind us
> > of them if we try to do so. ;-)
>
> I was saying "forget" in the sense that I already know about this, what
> other reasons are there?
The only other reason I can come up with off the top of my head is the
case where memory migrates along with the threads, so that keeping the
threads on the same node allows the shared memory to be migrated to
the node that both the threads are running on.
> > > > Can we simplify this? In most architectures I have dealt with, you can
> > > > do pretty well just by counting "hops" through nodes or interconnects.
> > >
> > > Yes, that was my expectation too. Platform code is perfectly welcome to
> > > set distance[i][j] = 1.6 for 1 hop, 4.7 for 2 hops etc though. For most,
> > > 1 for 1 hop, 2 for 2 hops would work fine, I suspect.
> >
> > In these cases, it would be helpful to have a mapping from hops to
> > distance,
> > perhaps something like hops_to_dist[i] giving the distance corresponding to
> > the given number of hops. If I was writing an app that cared about the
> > detailed performance of the machine, I might be tempted to just measure it!
> > Thoughts?
>
> Hmm, whatever made you think the value of hops_to_dist[i] would just
> depend on i? It could very well depend on the two end nodes involved
> in the hop. :-) Or more appropriately, on the characteristics of switches
> and routers involved in the path.
I agree that hops_to_dist[i] would very likely depend on all sorts of
things, including the amount of competing traffic along the path.
The thing that I am curious about is whether an application using
such a simple (and thus less accurate) view of the machine could
do almost as well as an application using a more detailed view.
> I think having a single distance vector is a good starting point. We can
> refine it, if an architecture needs it later.
My reaction would be to look at the code needed to perform the eight
operations listed near the beginning of the message (give or take some
not needed and additional ones that I didn't think to include).
Has anyone already looked into this?
Thanx, Paul