Re: NUMA-on-Linux roadmap, version 2

To: Paul McKenney/Beaverton/IBM@IBMUS cc: andi@suse.de, andrea@suse.de, aono@ed2.com1.fc.nec.co.jp, beckman@turbolabs.com, bjorn_helgaas@hp.com, Hubertus Franke/Watson/IBM@IBMUS, Jerry.Harrow@Compaq.com, jwright@engr.sgi.com, kanoj@engr.sgi.com, kumon@flab.fujitsu.co.jp, norton@mclinux.com, suganuma@hpc.bs1.fc.nec.co.jp, sunil.saxena@intel.com, tbutler@hi.com, woodman@missioncriticallinux.com, mikek@sequent.com, Kenneth Rozendal/Austin/IBM@IBMUS Date: 03/29/01 09:26 PM From: Kimio Suganuma Subject: Re: NUMA-on-Linux roadmap, version 2 Hi all, I feel this topology descriptor might be also useful for other purposes. For example, for hot-plugging, dynamic patitioning, MCA handling and so on. Do you think this /proc information should be generic or NUMA specific? Kimi > Topology Discovery > > o Methods from prior NUMA OSes: > > o SGI's /hardware (sp?). > > o ptx's TMP_* commands to the tmp_affinity() system call. > > o Compaq's "foreach()" library functions (see table 2-2 of > the .pdf file). > > o AIX's rs_numrads(), rs_getrad(), rs_getinfo(), and > rs_getlatency() library functions. > > o Suggestions from meeting involving extensions to /proc filesystem: > > 1. /proc/nodeinfo "flat file" with tags similar to /proc/cpuinfo. > For example: > > node: 0 > PCI busses: 0 1 > PCI cards: ... > > A hierarchical NUMA architecture could be handled in this > scheme by using dot-separated node numbers like this: > > node: 1.0 > > which would indicate the zeroth node within node 1. > > 2. /proc/nodeinfo containing relationship information, one > "parent" entity per line. For example: > > toplevel: node0 node1 node2 > node0: cpu0 cpu1 cpu2 cpu3 mem0 pcibus0 pcibus1 > pcibus0: dev0 otherdev0 > > Again, dot-separated numbers could be used to specify > hierarchical NUMA architectures: > > node1: node1.0 node1.1 node1.2 node1.3 > > 3. /proc/nodeinfo containing relationship information, one > relationship per line. This allows better tagging with > attributes (for example, memory size, bus speed, "distance" > estimates, etc.). It also allows non-tree architectures > to be specified easily. For example: > > toplevel: node0 > toplevel: node1 > toplevel: node2 > node0: mem0 512m > node0: cpu0 cpu1:1 cpu2:1 cpu3:1 mem0:1 cpu4:2 # object:dist > node0: cpu1 cpu0:1 cpu2:1 cpu3:1 mem0:1 cpu4:2 # object:dist > node0: cpu2 cpu0:1 cpu1:1 cpu3:1 mem0:1 cpu4:2 # object:dist > node0: cpu3 cpu0:1 cpu1:1 cpu2:1 mem0:1 cpu4:2 # object:dist > node0: pci0 66mhz > pci0: eth0 1gb > > The record for each CPU includes the "distances" to each > other CPU and to each memory. > > Hierarchical NUMA architectures would just indicate > the relationships: > > node1: node1.0 > > 4. /proc/node containing a directory structure describing the > system architecture. There would be a directory for each > node, which could contain subordinate directories for > subordinate nodes, as well as a file describing each resource > contained in the given node. In this case, we don't need > to prefix each node number with "node", since the fact that > there is a directory determines that it must be a node. > > The example above would have /proc/node containing directories > for each node, named "0", "1", and so on. These per-node > directories could themselves contain directories named > "0.0", "0.1", and so on to indicate nested NUMA nodes. > The directory /proc/node/0 would contain files cpu0, cpu1, > cpu2, and cpu3, and these files would contain distance > records to each other cpu and to each memory. For example, > /proc/node/0/cpu0 might contain: > > cpu1 1 > cpu2 1 > cpu3 1 > mem0 1 > cpu4 2 > > Non-tree architectures (e.g., the ring example discussed > yesterday) could use symbolic links or circular paths as > needed. > > In all these examples, the kernel would need to maintain the topology > information in some convenient form: a linked structure, for example. > > So where do you get the distance information from? I believe that > this must be supplied by architecture-specific functions: on some > systems, the firmware might supply this information, while on others, > the architecture-specific Linux code will likely have to make > assumptions based on the system structure. Would an API like the > following be appropriate? > > int numa_distance(char *obj1, char *obj2) > > where obj1 and obj2 are the names of the CPUs or memories for which > distance information is desired. > > All these proposals leave open the question of how to represent > distance information from memory and CPU to devices. > > Can we simplify this? In most architectures I have dealt with, you can > do pretty well just by counting "hops" through nodes or interconnects. > I have a really hard time believing that any software is going to go to > the trouble of really handling the n-squared distance metrics without also > going to the trouble of just measuring the bandwidths and latencies. > > Thoughts? Other approaches? > > Thanx, Paul > -- Kimio Suganuma