To: Paul McKenney/Beaverton/IBM@IBMUS cc: andi@suse.de, andrea@suse.de, Hubertus Franke/Watson/IBM@IBMUS, kanoj@engr.sgi.com, bjorn_helgaas@hp.com, jwright@engr.sgi.com, sunil.saxena@intel.com, Jerry.Harrow@Compaq.com, kumon@flab.fujitsu.co.jp, aono@ed2.com1.fc.nec.co.jp, suganuma@hpc.bs1.fc.nec.co.jp, woodman@missioncriticallinux.com, norton@mclinux.com, beckman@turbolabs.com, tbutler@hi.com Date: 03/28/01 03:57 PM From: Kanoj Sarcar Subject: Re: your mail > o Memory-allocation policies (e.g., DISCONTIGMEM, first-touch > allocation, > node-specific page allocation, node-specific bootmem allocation, > read-only mapping replication, per-node kmalloc(), per-node > kswapd/kreclaimd. Status: mostly done in 2.4 Just to be clear: _kernel_ read-only replication has been implemented on some architectures. Per-node kmalloc has not been, and volunteers in this area are needed. We need the option of architecture code to ask for pernode kswapd/kreclaimd (do not assume all NUMA machines will want pernode daemons, some might want a daemon for four nodes etc), and this is 2.5 work. > > o Hierarchical scheduling policies. Status: some patches available. > Urgency: high. What is hierarchical scheduling? Handling multilevel caches, nested-numa etc? If so, as I mentioned, some progress has been made in the sense that Linus acknowledges the issue and is open to a per-arch scheduling solution. IE, goodness() will be replaced with arch_goodness() etc. > o Multipath I/O. Status: efforts in progress. Urgency: low to > moderate, > depending on architecture. It wasn't clear to me what the relationship between Multipath I/O and NUMA is. True, in case a user program wants to allocate memory near a device, he has more choices if the device is multipath. Are there other relationships I am missing? > o Support for nested NUMA (e.g., nodes composed of several multi-CPU > chips). Status: Not yet started. Urgency: moderate, depending > on architecture. Only scheduling optimizations here, right? > o Support for non-tree topologies. There were some comments about > this possibility yesterday. Does anyone wish to stand up and be > counted? It does add some complexity, so do we really want to > support it? Hmm, IRIX's /hw ie hwgfs is a graph (hardware-graph file system). This is because it represents the interconnect between nodes. I remember graphs and trees being mentioned while talking about /proc extensions, is that what you are talking about here? > > Category 2: Facilities to bind tasks and memory to CPUs and nodes > > o Topology discovery (see below for some approaches). > Status: Not yet started. Urgency: high. I suspect this is completely architecture/platform specific. The generic NUMA code will expect a distance matrix distances[N][N], which the platform code has to populate. Device-node distance is rarely asked for, so it is okay to have something like a procedure call which you talk about later. > > o Binding tasks to CPUs or groups of CPUs (export runon/cpuset > functionality). Status: some patches available. Urgency: high. > > o Binding tasks and related groups of tasks to nodes or groups > of nodes. Status: not yet started. Urgency: high. > > o Binding memory to nodes or groups of nodes. Status: not yet started. > Urgency: high. I think I covered some reasons why SGI thinks the above are needed. Basically, soft partitioning (to reduce impact of one set of programs on another), and more repeatibility in performance. Do others have other reasons for these features? Both repeatibility and partitioning can be addressed by nodesets (which was an advanced option as I mentioned on the slide, but never got around to talking about), which I _think_ is easier to understand. Of course, cpusets/memory-placement gives more detailed control. Whether NUMA uses cpusets will also be dictated at least partly by whether cpusets become part of SMP kernels. > > o General specification of "hints" that indicate relationships > between different objects, including tasks, a given task's > data/bss/stack/heap ("process" to the proprietary-Unix people > among us), a given shm/mmap area, a given device, > a given CPU, a given node, and a given IPC conduit. > Status: Needs definition, research, prototyping, and > experimentation. Urgency: Hopefully reasonably low, > given the complexities that would be involved if not done > correctly. I remember we talked about kernel-to-user copying, but we decided it was a non issue, right? The thread could just ask to be close to the device. I also remember thread wanting to be close to IPC conduit, was the example that two threads were communicating via pipes, and both should be near the kernel maintained buffer for the pipe? Or was it something else? I thought we decided this could be handled by asking the kernel to schedule the threads on the same node, so that (hopefully) the kernel buffer would be on the same node. Is there a reason other than the above for two threads to be wanting to be on the same cpu (I know IBM has the concept of sharing cache affinity between multiple threads)? Or being on the same node (forget node level caches, local snoops etc)? > > So where do you get the distance information from? I believe that > this must be supplied by architecture-specific functions: on some > systems, the firmware might supply this information, while on others, > the architecture-specific Linux code will likely have to make > assumptions based on the system structure. Would an API like the > following be appropriate? > > int numa_distance(char *obj1, char *obj2) > > where obj1 and obj2 are the names of the CPUs or memories for which > distance information is desired. Much simpler to have a matrix populated at boot up time. (Lets not worry right now about hot plug, we can move to a procedure call if hot plug demands it). > > All these proposals leave open the question of how to represent > distance information from memory and CPU to devices. Chances are, we are not going to use device distances. If the user wants to allocate memory near device D1, the kernel will look at the device handle (what this will end up to be is an open issue, ie, is it ascii name, devfs handle, pci_dev * etc) and figure out the controlling/master node (with something like master_node(void *device_handle), and try to allocate memory on that node (falling back on the next best node based on inter-node distances). Similarly for user wanting to be on a cpu close to device. If it really came to needing device distances, I would suggest something like device_distance(void *device_handle, char *kvaddr) to find distance from device to some piece of memory. This raises an interesting thing that I am taking for granted. All ccNUMA machines use the same bus/network/switches for PIO, DMA and internode memory transfer paths. So, the PIO distance of a cpu to a device would be the same as the DMA distance from the containing node to the device, also equal to the memory distance from any cpu on the containing node to the master node of the device. Is there an architecture that does not fit into this simplified picture? > > Can we simplify this? In most architectures I have dealt with, you can > do pretty well just by counting "hops" through nodes or interconnects. Yes, that was my expectation too. Platform code is perfectly welcome to set distance[i][j] = 1.6 for 1 hop, 4.7 for 2 hops etc though. For most, 1 for 1 hop, 2 for 2 hops would work fine, I suspect. Kanoj