To: andi@suse.de, andrea@suse.de, Hubertus Franke/Watson/IBM, kanoj@engr.sgi.com, bjorn_helgaas@hp.com, jwright@engr.sgi.com, sunil.saxena@intel.com, Jerry.Harrow@Compaq.com, kumon@flab.fujitsu.co.jp, aono@ed2.com1.fc.nec.co.jp, suganuma@hpc.bs1.fc.nec.co.jp, woodman@missioncriticallinux.com, norton@mclinux.com, beckman@turbolabs.com, tbutler@hi.com Date: 03/27/01 11:34 PM From: Paul McKenney Subject: Hello! Thank you all for contributing to yesterday's NUMA-on-Linux get-together! I believe we made good progress, and look forward to continuing to work with you all. I have received a few presentations, if you have not already done so, please send me softcopy of your presentation so that I can post it on SourceForge. At the end of the meeting, I said I would send out a list of the NUMA-related tasks grouped into three categories: (1) make Linux's default handling of NUMA work well, (2) facilities to bind tasks and memory to CPUs and nodes, and (3) facilities to provide more general scheduling constraints and hints. This is -not- a priority list, nor is it a schedule. It is instead simply a way of grouping the tasks to help understand how they relate. I have tagged each item with a swag at its current status and urgency. Although I based these swags on yesterday's conversations and presentations, I am sure that a number of them could use some adjustment, and I know you guys will not be shy. ;-) Category 1: Make Linux's default handling of NUMA work well o Memory-allocation policies (e.g., DISCONTIGMEM, first-touch allocation, node-specific page allocation, node-specific bootmem allocation, read-only mapping replication, per-node kmalloc(), per-node kswapd/kreclaimd. Status: mostly done in 2.4 o Hierarchical scheduling policies. Status: some patches available. Urgency: high. o NUMAization of internal kernel data structures and algorithms, as indicated by benchmarking and workload experiences. Status: Not yet started. Urgency: TBD based on benchmark data. o Multipath I/O. Status: efforts in progress. Urgency: low to moderate, depending on architecture. o Migration (has been helpful on some, but not all, architectures). Status: Not yet started. Urgency: low to moderate, depending on architecture. o Support for nested NUMA (e.g., nodes composed of several multi-CPU chips). Status: Not yet started. Urgency: moderate, depending on architecture. o Support for non-tree topologies. There were some comments about this possibility yesterday. Does anyone wish to stand up and be counted? It does add some complexity, so do we really want to support it? Category 2: Facilities to bind tasks and memory to CPUs and nodes o Topology discovery (see below for some approaches). Status: Not yet started. Urgency: high. o Binding tasks to CPUs or groups of CPUs (export runon/cpuset functionality). Status: some patches available. Urgency: high. o Binding tasks and related groups of tasks to nodes or groups of nodes. Status: not yet started. Urgency: high. o Binding memory to nodes or groups of nodes. Status: not yet started. Urgency: high. Category 3: Facilities to provide more general scheduling constraints and hints o Facilities that have seen real use: o Specification of a portion of a machine (e.g., Irix's mldset_place()). Status: not yet started. Urgency: High for scientific workloads sharing a machine(?), low for other workloads (e.g., commercial workloads have tended to use some form of software partitioning to get this effect). o Hints to link tasks and groups of tasks to each other (e.g., ptx's attach_proc() or AIX's ra_attach()). Status: not yet started. Urgency: High for commercial workloads, unknown for other workloads. o Facilities that seem compelling, but which have not yet seen use: o "Virtualization" of CPU and node IDs so that multiple apps can believe that they are scheduling on CPU 0. Status: not yet started. Urgency: Hopefully reasonably low, given the sharply differing views on how to go about doing this. ;-) o General specification of "hints" that indicate relationships between different objects, including tasks, a given task's data/bss/stack/heap ("process" to the proprietary-Unix people among us), a given shm/mmap area, a given device, a given CPU, a given node, and a given IPC conduit. Status: Needs definition, research, prototyping, and experimentation. Urgency: Hopefully reasonably low, given the complexities that would be involved if not done correctly. Topology Discovery o Methods from prior NUMA OSes: o SGI's /hardware (sp?). o ptx's TMP_* commands to the tmp_affinity() system call. o Compaq's "foreach()" library functions (see table 2-2 of the .pdf file). o AIX's rs_numrads(), rs_getrad(), rs_getinfo(), and rs_getlatency() library functions. o Suggestions from meeting involving extensions to /proc filesystem: 1. /proc/nodeinfo "flat file" with tags similar to /proc/cpuinfo. For example: node: 0 PCI busses: 0 1 PCI cards: ... A hierarchical NUMA architecture could be handled in this scheme by using dot-separated node numbers like this: node: 1.0 which would indicate the zeroth node within node 1. 2. /proc/nodeinfo containing relationship information, one "parent" entity per line. For example: toplevel: node0 node1 node2 node0: cpu0 cpu1 cpu2 cpu3 mem0 pcibus0 pcibus1 pcibus0: dev0 otherdev0 Again, dot-separated numbers could be used to specify hierarchical NUMA architectures: node1: node1.0 node1.1 node1.2 node1.3 3. /proc/nodeinfo containing relationship information, one relationship per line. This allows better tagging with attributes (for example, memory size, bus speed, "distance" estimates, etc.). It also allows non-tree architectures to be specified easily. For example: toplevel: node0 toplevel: node1 toplevel: node2 node0: mem0 512m node0: cpu0 cpu1:1 cpu2:1 cpu3:1 mem0:1 cpu4:2 # object:dist node0: cpu1 cpu0:1 cpu2:1 cpu3:1 mem0:1 cpu4:2 # object:dist node0: cpu2 cpu0:1 cpu1:1 cpu3:1 mem0:1 cpu4:2 # object:dist node0: cpu3 cpu0:1 cpu1:1 cpu2:1 mem0:1 cpu4:2 # object:dist node0: pci0 66mhz pci0: eth0 1gb The record for each CPU includes the "distances" to each other CPU and to each memory. Hierarchical NUMA architectures would just indicate the relationships: node1: node1.0 4. /proc/node containing a directory structure describing the system architecture. There would be a directory for each node, which could contain subordinate directories for subordinate nodes, as well as a file describing each resource contained in the given node. In this case, we don't need to prefix each node number with "node", since the fact that there is a directory determines that it must be a node. The example above would have /proc/node containing directories for each node, named "0", "1", and so on. These per-node directories could themselves contain directories named "0.0", "0.1", and so on to indicate nested NUMA nodes. The directory /proc/node/0 would contain files cpu0, cpu1, cpu2, and cpu3, and these files would contain distance records to each other cpu and to each memory. For example, /proc/node/0/cpu0 might contain: cpu1 1 cpu2 1 cpu3 1 mem0 1 cpu4 2 Non-tree architectures (e.g., the ring example discussed yesterday) could use symbolic links or circular paths as needed. In all these examples, the kernel would need to maintain the topology information in some convenient form: a linked structure, for example. So where do you get the distance information from? I believe that this must be supplied by architecture-specific functions: on some systems, the firmware might supply this information, while on others, the architecture-specific Linux code will likely have to make assumptions based on the system structure. Would an API like the following be appropriate? int numa_distance(char *obj1, char *obj2) where obj1 and obj2 are the names of the CPUs or memories for which distance information is desired. All these proposals leave open the question of how to represent distance information from memory and CPU to devices. Can we simplify this? In most architectures I have dealt with, you can do pretty well just by counting "hops" through nodes or interconnects. I have a really hard time believing that any software is going to go to the trouble of really handling the n-squared distance metrics without also going to the trouble of just measuring the bandwidths and latencies. Thoughts? Other approaches? Thanx, Paul