Re: your mail (NUMA-on-Linux roadmap)

To: Kanoj Sarcar cc: andi@suse.de, andrea@suse.de, aono@ed2.com1.fc.nec.co.jp, beckman@turbolabs.com, bjorn_helgaas@hp.com, Hubertus Franke/Watson/IBM@IBMUS, Jerry.Harrow@Compaq.com, jwright@engr.sgi.com, kanoj@engr.sgi.com, Kenneth Rozendal/Austin/IBM@IBMUS, kumon@flab.fujitsu.co.jp, mikek@sequent.com, norton@mclinux.com, suganuma@hpc.bs1.fc.nec.co.jp, sunil.saxena@intel.com, tbutler@hi.com, woodman@missioncriticallinux.com Date: 04/01/01 10:34 PM From: Paul McKenney/Beaverton/IBM@IBMUS Subject: Re: your mail (NUMA-on-Linux roadmap) > > > > There is some possibility of hierarchical NUMA architectures where > > > > some memories have lower latency to each other than to other memories > > > > (for example, consider a conventional NUMA system with a tree-shaped > > > > interconnect). If we decide to support this sort of architecture, > > > > there would be some incentive to modify the memory allocation so > > > > that if there is no memory on the desired node, then the other nodes > > > > are searched in increasing order of distance from the requesting task. > > > > > > Exactly. That is what distance[i][j] will be used for. Of course, you > > > can argue that instead we could have something like > > > > > > for (i = 1; i < max_radius; i++) > > > for_each_node_at_distance(targetnode, i) > > > ... > > > > > > where we pass in a target best node, and we iterate over longer and > > > longer distances, identifying nodes, and trying to allocate a page, > > > stopping when we get something. > > > > > > I am assuming a construct like for_each_node_at_distance() or > > > distance[i][j] will _completely_ relieve generic code from looking > > > at specific topologies. Anyone sees why not? > > > > Answering this will take some thought. Here are the operations that I > > believe will be needed (I am using "diameter" in the graph-theoretic > > sense here): > > > > 1. Find the distance between a pair of nodes. > > 2. Given a node, find the closest other node. > > 3. Given a node, find the closest other node that has some resource > > or attribute (e.g., has more than a specified amount of memory > > available). > > 4. Find the largest clique of nodes of diameter less than some > > specified distance (but maybe you have a better way of satisfying > > the constraints of needing more than some number of CPUs "close" > > to each other). > > 5. Ditto, but some nodes are excluded from consideration > > (e.g., because they are being used by other applications, > > don't have enough memory, etc.). > > 6. Find the set of nodes of the specified size that has the > > smallest diameter. > > 7. Ditto, but some nodes are excluded from consideration. > > 8. Given a set of nodes, find the additional node that results > > in the smallest diameter of the resulting augmented set. > > This would be needed when expanding resources available to > > an application. > > Note that distance[i][j] will trivially solve 1 and 2. True, you can > optimize how you do 2 by having an ordered list of nodes for each node, > but we can get to that once we have a rough implementation ... if it > becomes obvious that operation 2 is frequent and needs to be optimized. > To do 3, you need something like does_node_have_resource(i), and the > most common use of this can be handled via a pernode mem_avail[] array > (possibly). I agree that having all distance[i][j] values would give you the full set of information, at least in a theoretical sense (though, as we discussed earlier, it would be good to wrap it in a macro to allow for possible online insertion/deletion of nodes). One question I have is whether software modules that "knew" the topology could be simpler (though you clearly need different versions of such modules for each different architecture/topology). Another is whether a different representation would make life significantly simpler (a trivial case where it does is #2, where sorting the lists is helpful, though maybe not enough to be worthwhile). > (Oh, maybe I have not mentioned this, but there is currently a way > to go from cpu number to node number). This is very good! Is there the inverse mapping from node number to CPU number (with some sort of error in case we allow nodes that do not have CPUs)? One can get this by mapping all the CPUs to their nodes and keeping track, of course, but... > The rest of the options might not be used as-is ... ie, the query may > have to be modified a bit. In any case, distance[i][j] will still > solve these, and yes, there may be better ways of doing it rather > than using distance[i][j]. Btw, funny that you bring up node exclusion ... > because that is what cpusets/memorysets/nodesets do implicitly; in which > case, you actually start with a set of nodes from the global set of > nodes, and you do the search on those. I guess that would be the > delete-before-search approach, rather than the search-and-delete > approach that you seem to imply. I agree that delete-before-search seems simpler and more efficient. Are you OK with adding new representations as the need arises? One way of making this easier to change (in case someone later comes up with a better representation for a given operation) is to use a macro/function that hides the exact representation. > And just one last thing ... in your options, "close" could be replaced > by "a specific topology, like cube, or sphere" ... And should be, IMHO. Good point! In this case, the topology serves as a constraint on the "closeness". > Given how quickly we have honed in on trying to parse and apply > selector functions on the topology, I think the best thing for the > generic code to do is assume that all such interfaces will be in > topology.h and provided by the platform dependent code. One other operation occurred to me: 9. Given resources (tasks, memory segments, and perhaps in-kernel data structures) for one app allocated on a given set of CPUs and nodes, and ditto for another app, and given the one or both app wants more resources, what is the best way to reallocate, given the costs of moving things around and the ongoing costs of suboptimal placement? But perhaps we should punt on this one for the first go-round? > > > Same as above. for_each_node_at_distance() or distance[i][j] ... > > > > Again, sounds plausible. What about potential machines that have memory, > > CPUs, and/or I/O devices segregated? Should we ignore this possibility? > > What extra problems do you think this machine opens up (other than possibly > a lot more cpu-only and memory-only nodes)? I am not sure I understand what > you mean by segregated here ... Always, the trick is in mapping your > hardware to the conceptual numa node (cpus/memory/devices). This mapping > might not always follow board-level packaging ... obvious enough for > nested-numa. These architectures no longer guarantee that each node will contain at least one CPU. You might have nodes that contain only CPUs (like nested-NUMA), you might have nodes that contain only memory, and you might have nodes that contain only I/O devices. The electronics connecting these partial nodes might be an arbitrary graph, so that there might be multiple memories close to a given I/O device, but not close to each other. It seems that this would make the placement, migration, and other scheduling decisions more complex, but I have not really thought it through. It would certainly make more possibilities that the design would have to consider, and more test cases. Which is why I would be inclined to ignore such architectures unless someone is willing to speak for them. > > > > > I also remember thread wanting to be close to IPC conduit, was the > > > > example > > > > > that two threads were communicating via pipes, and both should be > > near > > > > the > > > > > kernel maintained buffer for the pipe? Or was it something else? I > > > > thought > > > > > we decided this could be handled by asking the kernel to schedule the > > > > > threads on the same node, so that (hopefully) the kernel buffer would > > be > > > > > on the same node. > > > > > > > > Although this does work, it has the undesirable effect of preventing > > > > migration > > > > (on systems that benefit from migration). By specifying that a pair of > > > > tasks should be "linked", the kernel is free to migrate tasks as needed > > > > to handle overload situations, while still keeping the pair of tasks > > close > > > > to each other. > > > > > > Could you explain how this _should_ work? AFAICS, the kernel needs to > > > be told that 2 threads need to be near each other. The kernel does put > > > them near each other. The ipc kernel buffers get allocated near both of > > > them. Then the kernel needs to migrate some processes. Since the threads > > > have been "linked" before, they are migrated and relocated somewhere > > > else, near each other again. What happens to the ipc kernel buffer? Who > > > relocates that? Internally, does the kernel also try to migrate the > > > objects that each of the relocated processes is using? > > > > In ptx, there are two cases: 1) there is a fixed per-connection IPC > > buffer, which will be small enough to reside comfortably in the per-node > > cache, and 2) there is no fixed buffer (think TCP), in which case, the > > STREAMS buffers/mbufs will be allocated "close" to the two threads. > > In the second case, there may be a short interval during which the > > STREAMS buffers/mbufs are in the wrong place, but once these are > > consumed, all is well. One could of course imagine migrating a fixed > > IPC buffer, but this seems like something one tunes if workloads > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > demonstrate that it is valuable. > > > > > > > Is there a reason other than the above for two threads to be wanting > > to > > > > > be on the same cpu (I know IBM has the concept of sharing cache > > affinity > > > > > between multiple threads)? Or being on the same node (forget node > > level > > > > > caches, local snoops etc)? > > > > > > > > Those of us working on machines with node-level caches and/or local > > snoops > > > > are not likely to forget them -- performance measurements will remind > > us > > > > of them if we try to do so. ;-) > > > > > > I was saying "forget" in the sense that I already know about this, what > > > other reasons are there? > > > > The only other reason I can come up with off the top of my head is the > > case where memory migrates along with the threads, so that keeping the > > threads on the same node allows the shared memory to be migrated to > > the node that both the threads are running on. > > Right, this was what my question was: when you migrate processes, do > you migrate kernel objects that they share? To answer that, you broke > your answer into 2 parts, one for TCP mbufs, and one for fixed buffers. > The difference here is the nature of the buffer, fixed or dynamic. Yep, exactly! > For > dynamic things, everything will work out eventually. For fixed things, > migrating the object does not make sense, except possibly in extreme > circumstances (at least, that is what I read from your statement > "but this seems like something one tunes"). If the code to do the migration is not too complex, there would be some motivation to migrate in-kernel data structures if the expected number of accesses to them is large compared to the size of the object, at least in cases where all tasks referencing the in-kernel object fit into a single node. > In which case, the logic > that you _could_ migrate the shared memory alongwith the migrating > threads looses a bit of its edge. No? Remember, we are discussing > reasons it might make sense to ask the kernel to put two threads on > the same cpu/node. Isn't it a little far fetched to ask two threads > to be on the same cpu/node, in the hope that they will allocate > shm pages on the same node, *and then be migrated, and the shm follows > them*? In ptx, we didn't migrate shm, because it normally was striped across all nodes that you would choose to migrate the pairs of threads to: we used workload-management features to enforce this. Of course, one could slowly page out the shm, and then rely on the threads to page it back in to a more optimal location, but I am not sure whether this is really worth it. > You can turn around and say this *is* exactly the case you want to > migrate the object too ... in that case, I would like to understand > what your heuristics would be to decide which objects to migrate. > > Don't get me wrong ... I am sure there is something I am missing. > All I am saying is that this logic of associated threads sharing > kernel data internally and everything migrating together just does > not seem appealing right now. Which leaves us with node local caches, > local snoops etc to justify being on the same node/cpu. Which is still > a very good reason. In ptx, we migrate process-at-a-time, so once we decide to migrate a process, all the kernel datastructures that are easy to move and are worth moving (based on the above heuristic on # references vs. size) are unconditionally moved. But we only migrate after a considerable time spent in "remote execution", where the threads are executing on a different node than their memory resides on. We have some hysteresis to keep things from thrashing, and we don't migrate lightly. This worked quite well for us, but of course mileage might vary on other architectures running other workloads. > Btw, I have a different problem that needs solving now. How do > you expose multilevel caches, preferably in an architecture independant > way, to the user so he can decide better how to tie processes and cpus? > The *best* way would be to come up with a IRIX-like hwgfs, but suitably > generalized for other architectures, along with platform specific > information. This is one of the topics I discussed at yesterday's > kernel BOF (not in any detail), writeup is at > > http://oss.sgi.com/projects/numa/download/bof.033001. How was this received at the BOF? There was some discussion of how best to define the topology (what CPUs and memories are connected to each other) last Monday. I don't have much in the way of preferences as long as applications can easily access/parse the information. And it seems reasonable to me to augment the CPU, memory, and node information with the information on the caches. What are people's thoughts? Kanoj, is there a pointer to public info on hwgfs that would given people rough idea of what you have in mind? (I will search www.sgi.com when I get back to a higher bandwidth connection, but thought you might have a good pointer off the top of your head.) Thanx, Paul