Re: NUMA-on-Linux roadmap, version 2

To: Kanoj Sarcar cc: Paul McKenney/Beaverton/IBM@IBMUS, andi@suse.de, andrea@suse.de, aono@ed2.com1.fc.nec.co.jp, beckman@turbolabs.com, bjorn_helgaas@hp.com, Jerry.Harrow@Compaq.com, jwright@engr.sgi.com, kanoj@engr.sgi.com, kumon@flab.fujitsu.co.jp, norton@mclinux.com, suganuma@hpc.bs1.fc.nec.co.jp, sunil.saxena@intel.com, tbutler@hi.com, woodman@missioncriticallinux.com, mikek@sequent.com, Kenneth Rozendal/Austin/IBM@IBMUS Date: 03/30/01 06:44 AM From: Hubertus Franke/Watson/IBM@IBMUS Subject: Re: NUMA-on-Linux roadmap, version 2 I agree with you Kanoj. I already pointed out to Paul that some of the API, such as cpu-binding and memory binding applies to large scale SMP as well. During the meeting the necessity of running particular threads (e.g. device firmware upgrades) indicated usefulness for smaller SMPs as well. My view here has long been to treat large scale SMP systems as NUMA purely for increase in locality and to naturally get some of the load distributions (e.g. reprensent SMP memory as if it was multiple nodes and get automatic benefits such as multiple kswapds). So if we key some general Kernel acceptance, we can make this case to Linus. As you pointed out in this case to state what is the minimal support we need to provide (e.g cpu-binding, memory locality, scheduling partitiong (e.g. Andrea's way or the MQ way). In that light, it might be useful to restate our goal to "SMP and NUMA-API" and start layering it. Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: frankeh@us.ibm.com (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 To: Paul McKenney/Beaverton/IBM@IBMUS cc: andi@suse.de, andrea@suse.de, aono@ed2.com1.fc.nec.co.jp, beckman@turbolabs.com, bjorn_helgaas@hp.com, Hubertus Franke/Watson/IBM@IBMUS, Jerry.Harrow@Compaq.com, jwright@engr.sgi.com, kanoj@engr.sgi.com, kumon@flab.fujitsu.co.jp, norton@mclinux.com, suganuma@hpc.bs1.fc.nec.co.jp, sunil.saxena@intel.com, tbutler@hi.com, woodman@missioncriticallinux.com, mikek@sequent.com, Kenneth Rozendal/Austin/IBM@IBMUS Subject: Re: NUMA-on-Linux roadmap, version 2 > > o General specification of "hints" that indicate relationships > between different objects, including tasks, a given task's > data/bss/stack/heap ("process" to the proprietary-Unix people > among us), a given shm/mmap area, a given device, > a given CPU, a given node, and a given IPC conduit. > Status: Needs definition, research, prototyping, and > experimentation. Urgency: Hopefully reasonably low, > given the complexities that would be involved if not done > correctly. > > I believe that this is more a research topic than a set of > requirements that one could develop to at present (though there > is certainly ample room for practical prototyping of some of the > alternatives). I have asked Hubertus Franke to check and see > if there are universities or other research institutions that > would be interested in working on this. People interested in > this area, please let Hubertus know! > If it wasn't apparent from the fact that I proposed this, I would like to mention I am trying to come up with an api based on this idea. I am trying to make it general enough so that you could also ask for physical cpu/memory bindings via the api. I will refrain from talking about this until I have something more concrete, at which point we can discuss whether it is too complicated etc. By the way, the IRIX mld* interfaces were in concept designed to do something like this, but were later abused to be representations of actual hardware, instead of providing the virtualization that they should provide. Instead of trying to focus in on implementation specifics right away, I would suggest just focusing on the things that we want to do. It seems accepted that we want to be able to tie a thread to a set of processors, be able to allocate memory from certain nodes etc. _How_ to tell the kernel to do all this is either via cpuset/memoryset/nodeset/this-new-hunky-api etc. This will at least partly depend on whether some of these will become part of standard Linux. For example, there is good logic to claim that cpuset/memorysets are needed for non-NUMA too ... in which case, no point implementing nodeset/this-new-hunky-api. I am expecting Linus to provide some feedback on cpuset/memoryset acceptability. Kanoj