To: Kanoj Sarcar
cc: Paul McKenney/Beaverton/IBM@IBMUS, andi@suse.de, andrea@suse.de, aono@ed2.com1.fc.nec.co.jp, beckman@turbolabs.com, bjorn_helgaas@hp.com, Jerry.Harrow@Compaq.com, jwright@engr.sgi.com, kanoj@engr.sgi.com, kumon@flab.fujitsu.co.jp, norton@mclinux.com, suganuma@hpc.bs1.fc.nec.co.jp, sunil.saxena@intel.com, tbutler@hi.com, woodman@missioncriticallinux.com, mikek@sequent.com, Kenneth Rozendal/Austin/IBM@IBMUS
Date: 03/30/01 06:44 AM
From: Hubertus Franke/Watson/IBM@IBMUS
Subject: Re: NUMA-on-Linux roadmap, version 2
I agree with you Kanoj.
I already pointed out to Paul that some of the API, such as cpu-binding
and memory binding applies to large scale SMP as well.
During the meeting the necessity of running particular threads
(e.g. device firmware upgrades) indicated usefulness for smaller SMPs as
well.
My view here has long been to treat large scale SMP systems as NUMA purely
for increase in locality and to naturally get some of the load distributions
(e.g. reprensent SMP memory as if it was multiple nodes and get automatic
benefits such as multiple kswapds).
So if we key some general Kernel acceptance, we can make this case to
Linus. As you pointed out in this case to state what is the minimal support
we need to provide (e.g cpu-binding, memory locality, scheduling partitiong
(e.g. Andrea's way or the MQ way).
In that light, it might be useful to restate our goal to
"SMP and NUMA-API" and start layering it.
Hubertus Franke
Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair)
email: frankeh@us.ibm.com
(w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003
To: Paul McKenney/Beaverton/IBM@IBMUS
cc: andi@suse.de, andrea@suse.de, aono@ed2.com1.fc.nec.co.jp, beckman@turbolabs.com, bjorn_helgaas@hp.com, Hubertus Franke/Watson/IBM@IBMUS, Jerry.Harrow@Compaq.com, jwright@engr.sgi.com, kanoj@engr.sgi.com, kumon@flab.fujitsu.co.jp, norton@mclinux.com, suganuma@hpc.bs1.fc.nec.co.jp, sunil.saxena@intel.com, tbutler@hi.com, woodman@missioncriticallinux.com, mikek@sequent.com, Kenneth Rozendal/Austin/IBM@IBMUS
Subject: Re: NUMA-on-Linux roadmap, version 2
>
> o General specification of "hints" that indicate relationships
> between different objects, including tasks, a given task's
> data/bss/stack/heap ("process" to the proprietary-Unix people
> among us), a given shm/mmap area, a given device,
> a given CPU, a given node, and a given IPC conduit.
> Status: Needs definition, research, prototyping, and
> experimentation. Urgency: Hopefully reasonably low,
> given the complexities that would be involved if not done
> correctly.
>
> I believe that this is more a research topic than a set of
> requirements that one could develop to at present (though there
> is certainly ample room for practical prototyping of some of the
> alternatives). I have asked Hubertus Franke to check and see
> if there are universities or other research institutions that
> would be interested in working on this. People interested in
> this area, please let Hubertus know!
>
If it wasn't apparent from the fact that I proposed this, I would like
to mention I am trying to come up with an api based on this idea. I am
trying to make it general enough so that you could also ask for physical
cpu/memory bindings via the api. I will refrain from talking about this
until I have something more concrete, at which point we can discuss
whether it is too complicated etc. By the way, the IRIX mld* interfaces
were in concept designed to do something like this, but were later abused
to be representations of actual hardware, instead of providing the
virtualization that they should provide.
Instead of trying to focus in on implementation specifics right away,
I would suggest just focusing on the things that we want to do. It seems
accepted that we want to be able to tie a thread to a set of processors,
be able to allocate memory from certain nodes etc. _How_ to tell the
kernel to do all this is either via cpuset/memoryset/nodeset/this-new-hunky-api
etc. This will at least partly depend on whether some of these will become
part of standard Linux. For example, there is good logic to claim that
cpuset/memorysets are needed for non-NUMA too ... in which case, no point
implementing nodeset/this-new-hunky-api. I am expecting Linus to provide
some feedback on cpuset/memoryset acceptability.
Kanoj