Re: NUMA-on-Linux roadmap, version 2

To: Kanoj Sarcar cc: andi@suse.de, andrea@suse.de, aono@ed2.com1.fc.nec.co.jp, beckman@turbolabs.com, bjorn_helgaas@hp.com, Hubertus Franke/Watson/IBM@IBMUS, Jerry.Harrow@Compaq.com, jwright@engr.sgi.com, kanoj@engr.sgi.com, Kenneth Rozendal/Austin/IBM@IBMUS, kumon@flab.fujitsu.co.jp, mikek@sequent.com, norton@mclinux.com, suganuma@hpc.bs1.fc.nec.co.jp, sunil.saxena@intel.com, tbutler@hi.com, woodman@missioncriticallinux.com Date: 03/29/01 08:08 PM From: Paul McKenney/Beaverton/IBM@IBMUS Subject: Re: NUMA-on-Linux roadmap, version 2 > > o General specification of "hints" that indicate relationships > > between different objects, including tasks, a given task's > > data/bss/stack/heap ("process" to the proprietary-Unix people > > among us), a given shm/mmap area, a given device, > > a given CPU, a given node, and a given IPC conduit. > > Status: Needs definition, research, prototyping, and > > experimentation. Urgency: Hopefully reasonably low, > > given the complexities that would be involved if not done > > correctly. > > > > I believe that this is more a research topic than a set of > > requirements that one could develop to at present (though there > > is certainly ample room for practical prototyping of some of the > > alternatives). I have asked Hubertus Franke to check and see > > if there are universities or other research institutions that > > would be interested in working on this. People interested in > > this area, please let Hubertus know! > > If it wasn't apparent from the fact that I proposed this, I would like > to mention I am trying to come up with an api based on this idea. I am > trying to make it general enough so that you could also ask for physical > cpu/memory bindings via the api. I will refrain from talking about this > until I have something more concrete, at which point we can discuss > whether it is too complicated etc. This was not at all apparent to me -- you are referring to the first bullet on slide 10 of the slideset you presented? Please accept my apologies for not putting two and two together here. However, I believe that things will work out as you would like. For one thing, it will take -at- -least- a month for any kind of formal research project to get underway, which allows you time to get your ideas set down so we can discuss them. Given the nature of scheduling problems, there will almost certainly be ample room for research into heuristics that work best for various workloads and machine architectures, so the research angle will not be wasted in any case. I say this because the kind of multiple-resource scheduling problems that are required are NP-complete even assuming perfect foreknowledge, and, lacking forknowledge, good heuristics are the best one can possibly hope for. It is really hard to come up with good heuristics. Most likely, the research will find a number of reasonable ones that the Linux community can choose from, build on, learn from, or ignore totally if they all miss the mark. > By the way, the IRIX mld* interfaces > were in concept designed to do something like this, but were later abused > to be representations of actual hardware, instead of providing the > virtualization that they should provide. This is -precisely- the outcome that I am trying to avoid in NUMA-on-Linux. If there are (perhaps even 2.5-only throwaway) interfaces to do the simple binding operations, then there will be less pressure to abuse the implementation of the more general API. And I agree that this sort of abuse happens a lot, I have seen it over and over again in many different types of projects. Most NUMA implementations require changes to many parts of the OS, so the desire to "keep it simple" can come from many directions. > Instead of trying to focus in on implementation specifics right away, > I would suggest just focusing on the things that we want to do. It seems > accepted that we want to be able to tie a thread to a set of processors, > be able to allocate memory from certain nodes etc. _How_ to tell the > kernel to do all this is either via cpuset/memoryset/nodeset/this-new-hunky-api > etc. This will at least partly depend on whether some of these will become > part of standard Linux. For example, there is good logic to claim that > cpuset/memorysets are needed for non-NUMA too ... in which case, no point > implementing nodeset/this-new-hunky-api. I am expecting Linus to provide > some feedback on cpuset/memoryset acceptability. Having cpuset/memorysets for SMP as well as NUMA would certainly help out the people who are starting to run benchmarks on Linux! I believe that we all look forward to hearing what Linus has to say on this. Thanx, Paul