Re: your mail (NUMA-on-Linux roadmap)

To: Paul McKenney/Beaverton/IBM@IBMUS cc: andi@suse.de, andrea@suse.de, aono@ed2.com1.fc.nec.co.jp, beckman@turbolabs.com, bjorn_helgaas@hp.com, Hubertus Franke/Watson/IBM@IBMUS, Jerry.Harrow@Compaq.com, jwright@engr.sgi.com, kanoj@engr.sgi.com, Kenneth Rozendal/Austin/IBM@IBMUS, kumon@flab.fujitsu.co.jp, mikek@sequent.com, norton@mclinux.com, suganuma@hpc.bs1.fc.nec.co.jp, sunil.saxena@intel.com, tbutler@hi.com, woodman@missioncriticallinux.com Date: 03/31/01 11:37 AM From: Kanoj Sarcar Subject: Re: your mail (NUMA-on-Linux roadmap) > > > > > There is some possibility of hierarchical NUMA architectures where > > > some memories have lower latency to each other than to other memories > > > (for example, consider a conventional NUMA system with a tree-shaped > > > interconnect). If we decide to support this sort of architecture, > > > there would be some incentive to modify the memory allocation so > > > that if there is no memory on the desired node, then the other nodes > > > are searched in increasing order of distance from the requesting task. > > > > Exactly. That is what distance[i][j] will be used for. Of course, you > > can argue that instead we could have something like > > > > for (i = 1; i < max_radius; i++) > > for_each_node_at_distance(targetnode, i) > > ... > > > > where we pass in a target best node, and we iterate over longer and > > longer distances, identifying nodes, and trying to allocate a page, > > stopping when we get something. > > > > I am assuming a construct like for_each_node_at_distance() or > > distance[i][j] will _completely_ relieve generic code from looking > > at specific topologies. Anyone sees why not? > > Answering this will take some thought. Here are the operations that I > believe will be needed (I am using "diameter" in the graph-theoretic > sense here): > > 1. Find the distance between a pair of nodes. > 2. Given a node, find the closest other node. > 3. Given a node, find the closest other node that has some resource > or attribute (e.g., has more than a specified amount of memory > available). > 4. Find the largest clique of nodes of diameter less than some > specified distance (but maybe you have a better way of satisfying > the constraints of needing more than some number of CPUs "close" > to each other). > 5. Ditto, but some nodes are excluded from consideration > (e.g., because they are being used by other applications, > don't have enough memory, etc.). > 6. Find the set of nodes of the specified size that has the > smallest diameter. > 7. Ditto, but some nodes are excluded from consideration. > 8. Given a set of nodes, find the additional node that results > in the smallest diameter of the resulting augmented set. > This would be needed when expanding resources available to > an application. Note that distance[i][j] will trivially solve 1 and 2. True, you can optimize how you do 2 by having an ordered list of nodes for each node, but we can get to that once we have a rough implementation ... if it becomes obvious that operation 2 is frequent and needs to be optimized. To do 3, you need something like does_node_have_resource(i), and the most common use of this can be handled via a pernode mem_avail[] array (possibly). (Oh, maybe I have not mentioned this, but there is currently a way to go from cpu number to node number). The rest of the options might not be used as-is ... ie, the query may have to be modified a bit. In any case, distance[i][j] will still solve these, and yes, there may be better ways of doing it rather than using distance[i][j]. Btw, funny that you bring up node exclusion ... because that is what cpusets/memorysets/nodesets do implicitly; in which case, you actually start with a set of nodes from the global set of nodes, and you do the search on those. I guess that would be the delete-before-search approach, rather than the search-and-delete approach that you seem to imply. And just one last thing ... in your options, "close" could be replaced by "a specific topology, like cube, or sphere" ... Given how quickly we have honed in on trying to parse and apply selector functions on the topology, I think the best thing for the generic code to do is assume that all such interfaces will be in topology.h and provided by the platform dependent code. > > > > Same as above. for_each_node_at_distance() or distance[i][j] ... > > Again, sounds plausible. What about potential machines that have memory, > CPUs, and/or I/O devices segregated? Should we ignore this possibility? What extra problems do you think this machine opens up (other than possibly a lot more cpu-only and memory-only nodes)? I am not sure I understand what you mean by segregated here ... Always, the trick is in mapping your hardware to the conceptual numa node (cpus/memory/devices). This mapping might not always follow board-level packaging ... obvious enough for nested-numa. > > > > I also remember thread wanting to be close to IPC conduit, was the > > > example > > > > that two threads were communicating via pipes, and both should be > near > > > the > > > > kernel maintained buffer for the pipe? Or was it something else? I > > > thought > > > > we decided this could be handled by asking the kernel to schedule the > > > > threads on the same node, so that (hopefully) the kernel buffer would > be > > > > on the same node. > > > > > > Although this does work, it has the undesirable effect of preventing > > > migration > > > (on systems that benefit from migration). By specifying that a pair of > > > tasks should be "linked", the kernel is free to migrate tasks as needed > > > to handle overload situations, while still keeping the pair of tasks > close > > > to each other. > > > > Could you explain how this _should_ work? AFAICS, the kernel needs to > > be told that 2 threads need to be near each other. The kernel does put > > them near each other. The ipc kernel buffers get allocated near both of > > them. Then the kernel needs to migrate some processes. Since the threads > > have been "linked" before, they are migrated and relocated somewhere > > else, near each other again. What happens to the ipc kernel buffer? Who > > relocates that? Internally, does the kernel also try to migrate the > > objects that each of the relocated processes is using? > > In ptx, there are two cases: 1) there is a fixed per-connection IPC > buffer, which will be small enough to reside comfortably in the per-node > cache, and 2) there is no fixed buffer (think TCP), in which case, the > STREAMS buffers/mbufs will be allocated "close" to the two threads. > In the second case, there may be a short interval during which the > STREAMS buffers/mbufs are in the wrong place, but once these are > consumed, all is well. One could of course imagine migrating a fixed > IPC buffer, but this seems like something one tunes if workloads ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > demonstrate that it is valuable. > > > > > Is there a reason other than the above for two threads to be wanting > to > > > > be on the same cpu (I know IBM has the concept of sharing cache > affinity > > > > between multiple threads)? Or being on the same node (forget node > level > > > > caches, local snoops etc)? > > > > > > Those of us working on machines with node-level caches and/or local > snoops > > > are not likely to forget them -- performance measurements will remind > us > > > of them if we try to do so. ;-) > > > > I was saying "forget" in the sense that I already know about this, what > > other reasons are there? > > The only other reason I can come up with off the top of my head is the > case where memory migrates along with the threads, so that keeping the > threads on the same node allows the shared memory to be migrated to > the node that both the threads are running on. Right, this was what my question was: when you migrate processes, do you migrate kernel objects that they share? To answer that, you broke your answer into 2 parts, one for TCP mbufs, and one for fixed buffers. The difference here is the nature of the buffer, fixed or dynamic. For dynamic things, everything will work out eventually. For fixed things, migrating the object does not make sense, except possibly in extreme circumstances (at least, that is what I read from your statement "but this seems like something one tunes"). In which case, the logic that you _could_ migrate the shared memory alongwith the migrating threads looses a bit of its edge. No? Remember, we are discussing reasons it might make sense to ask the kernel to put two threads on the same cpu/node. Isn't it a little far fetched to ask two threads to be on the same cpu/node, in the hope that they will allocate shm pages on the same node, *and then be migrated, and the shm follows them*? You can turn around and say this *is* exactly the case you want to migrate the object too ... in that case, I would like to understand what your heuristics would be to decide which objects to migrate. Don't get me wrong ... I am sure there is something I am missing. All I am saying is that this logic of associated threads sharing kernel data internally and everything migrating together just does not seem appealing right now. Which leaves us with node local caches, local snoops etc to justify being on the same node/cpu. Which is still a very good reason. Btw, I have a different problem that needs solving now. How do you expose multilevel caches, preferably in an architecture independant way, to the user so he can decide better how to tie processes and cpus? The *best* way would be to come up with a IRIX-like hwgfs, but suitably generalized for other architectures, along with platform specific information. This is one of the topics I discussed at yesterday's kernel BOF (not in any detail), writeup is at http://oss.sgi.com/projects/numa/download/bof.033001. Kanoj