To: Paul McKenney/Beaverton/IBM@IBMUS Cc: andi@suse.de, andrea@suse.de, aono@ed2.com1.fc.nec.co.jp, beckman@turbolabs.com, bjorn_helgaas@hp.com, Hubertus Franke/Watson/IBM@IBMUS, Jerry.Harrow@Compaq.com, jwright@engr.sgi.com, kanoj@engr.sgi.com, Kenneth Rozendal/Austin/IBM@IBMUS, kumon@flab.fujitsu.co.jp, mikek@sequent.com, norton@mclinux.com, suganuma@hpc.bs1.fc.nec.co.jp, sunil.saxena@intel.com, tbutler@hi.com, woodman@missioncriticallinux.com From: kanoj@google.engr.sgi.com Date: 04/02/01 03:07 PM Subject: Re: your mail (NUMA-on-Linux roadmap) > > One question I have is whether software modules that "knew" the topology > could be simpler (though you clearly need different versions of such > modules for each different architecture/topology). Another is whether > a different representation would make life significantly simpler (a > trivial case where it does is #2, where sorting the lists is helpful, > though maybe not enough to be worthwhile). Sure, I would like to hear about alternate representations if people have ideas. > > > (Oh, maybe I have not mentioned this, but there is currently a way > > to go from cpu number to node number). > > This is very good! Is there the inverse mapping from node number to Look for cputocnode(). > CPU number (with some sort of error in case we allow nodes that do not > have CPUs)? One can get this by mapping all the CPUs to their nodes > and keeping track, of course, but... No, there is no inverse mapping from node to list-of-cpus, simply because we haven't needed it thus far in Linux. > > > The rest of the options might not be used as-is ... ie, the query may > > have to be modified a bit. In any case, distance[i][j] will still > > solve these, and yes, there may be better ways of doing it rather > > than using distance[i][j]. Btw, funny that you bring up node exclusion > ... > > because that is what cpusets/memorysets/nodesets do implicitly; in which > > case, you actually start with a set of nodes from the global set of > > nodes, and you do the search on those. I guess that would be the > > delete-before-search approach, rather than the search-and-delete > > approach that you seem to imply. > > I agree that delete-before-search seems simpler and more efficient. > Are you OK with adding new representations as the need arises? One way > of making this easier to change (in case someone later comes up with a > better representation for a given operation) is to use a macro/function > that hides the exact representation. Sure. > 9. Given resources (tasks, memory segments, and perhaps in-kernel > data structures) for one app allocated on a given set of CPUs > and nodes, and ditto for another app, and given the one or both > app wants more resources, what is the best way to reallocate, > given the costs of moving things around and the ongoing costs > of suboptimal placement? > > But perhaps we should punt on this one for the first go-round? Yes. > > > > > Same as above. for_each_node_at_distance() or distance[i][j] ... > > > > > > Again, sounds plausible. What about potential machines that have > memory, > > > CPUs, and/or I/O devices segregated? Should we ignore this > possibility? > > > > What extra problems do you think this machine opens up (other than > possibly > > a lot more cpu-only and memory-only nodes)? I am not sure I understand > what > > you mean by segregated here ... Always, the trick is in mapping your > > hardware to the conceptual numa node (cpus/memory/devices). This mapping > > might not always follow board-level packaging ... obvious enough for > > nested-numa. > > These architectures no longer guarantee that each node will contain at > least > one CPU. You might have nodes that contain only CPUs (like nested-NUMA), > you might have nodes that contain only memory, and you might have nodes > that contain only I/O devices. The electronics connecting these partial > nodes might be an arbitrary graph, so that there might be multiple memories > close to a given I/O device, but not close to each other. Yes, I might have commented something that led people to believe there must be at least one cpu ... or some memory ... on all nodes. If so, I conveyed a wrong message. Remember, we talked about why per node kswapd might not work on all systems? Simply because a node (memory-node) might not have any cpus. So, yes, we definitely need to support cpu-only, memory-only and device-only nodes, at least in design, if not in practice. > > http://oss.sgi.com/projects/numa/download/bof.033001. > > How was this received at the BOF? Not badly. Linus (and Alan Cox) and a few others seem pretty sympathetic towards NUMA. Whether that finally translates into code acceptance is anyone's guess, but at least they are viewing NUMA the same way that they were viewing SMP a couple of years back ... NUMA's here, so Linux has to adapt to start running (well) on these machines. Kanoj