Re: NUMA-on-Linux roadmap, version 2

To: Paul McKenney/Beaverton/IBM@IBMUS, tony.luck@intel.com, kouchi@hpc.bs1.fc.nec.co.jp, asit.k.mallick@intel.com, jenna.s.hall@intel.com Cc: andi@suse.de, andrea@suse.de, aono@ed2.com1.fc.nec.co.jp, beckman@turbolabs.com, bjorn_helgaas@hp.com, Hubertus Franke/Watson/IBM@IBMUS, Jerry.Harrow@Compaq.com, jwright@engr.sgi.com, kanoj@engr.sgi.com, Kenneth Rozendal/Austin/IBM@IBMUS, kumon@flab.fujitsu.co.jp, mikek@sequent.com, norton@mclinux.com, suganuma@hpc.bs1.fc.nec.co.jp, sunil.saxena@intel.com, tbutler@hi.com, woodman@missioncriticallinux.com From: kanoj@google.engr.sgi.com Date: 04/05/01 01:38 AM Subject: Re: NUMA-on-Linux roadmap, version 2 > o Suggestions from meeting involving extensions to /proc filesystem: > > 1. /proc/nodeinfo "flat file" with tags similar to /proc/cpuinfo. > For example: > > node: 0 > PCI busses: 0 1 > PCI cards: ... > > A hierarchical NUMA architecture could be handled in this > scheme by using dot-separated node numbers like this: > > node: 1.0 > > which would indicate the zeroth node within node 1. > > 2. /proc/nodeinfo containing relationship information, one > "parent" entity per line. For example: > > toplevel: node0 node1 node2 > node0: cpu0 cpu1 cpu2 cpu3 mem0 pcibus0 pcibus1 > pcibus0: dev0 otherdev0 > > Again, dot-separated numbers could be used to specify > hierarchical NUMA architectures: > > node1: node1.0 node1.1 node1.2 node1.3 > > 3. /proc/nodeinfo containing relationship information, one > relationship per line. This allows better tagging with > attributes (for example, memory size, bus speed, "distance" > estimates, etc.). It also allows non-tree architectures > to be specified easily. For example: > > toplevel: node0 > toplevel: node1 > toplevel: node2 > node0: mem0 512m > node0: cpu0 cpu1:1 cpu2:1 cpu3:1 mem0:1 cpu4:2 # object:dist > node0: cpu1 cpu0:1 cpu2:1 cpu3:1 mem0:1 cpu4:2 # object:dist > node0: cpu2 cpu0:1 cpu1:1 cpu3:1 mem0:1 cpu4:2 # object:dist > node0: cpu3 cpu0:1 cpu1:1 cpu2:1 mem0:1 cpu4:2 # object:dist > node0: pci0 66mhz > pci0: eth0 1gb > > The record for each CPU includes the "distances" to each > other CPU and to each memory. > > Hierarchical NUMA architectures would just indicate > the relationships: > > node1: node1.0 > > 4. /proc/node containing a directory structure describing the > system architecture. There would be a directory for each > node, which could contain subordinate directories for > subordinate nodes, as well as a file describing each resource > contained in the given node. In this case, we don't need > to prefix each node number with "node", since the fact that > there is a directory determines that it must be a node. > > The example above would have /proc/node containing directories > for each node, named "0", "1", and so on. These per-node > directories could themselves contain directories named > "0.0", "0.1", and so on to indicate nested NUMA nodes. > The directory /proc/node/0 would contain files cpu0, cpu1, > cpu2, and cpu3, and these files would contain distance > records to each other cpu and to each memory. For example, > /proc/node/0/cpu0 might contain: > > cpu1 1 > cpu2 1 > cpu3 1 > mem0 1 > cpu4 2 > > Non-tree architectures (e.g., the ring example discussed > yesterday) could use symbolic links or circular paths as > needed. > Okay, I decided to prototype the last option. Reason being, it gives a way for the user to reference chipset registers. As well as being the most extensible, and being able to handle shared resources (caches) the best. The naming I finally came up with is a bit different from what the option suggests (mainly because I implemented first, then looked at what option I had implemented!), I wouldn't have issues changing it to match exactly the option. Patch is at http://oss.sgi.com/projects/numa/download/naming242.patch Most importantly, browse the topology.h file, then look at the implementation in topology.c if you have patience. Btw, the implementation currently uses /proc, but new uses of /proc are strongly discouraged. devfs is probably a better idea. A new fs is also possible. In any case, naming implementations are going to change a lot in 2.5, so any discussions about underlying implementations are probably not that worthwhile. This also means that the kernel interfaces are open to change (Eg, I don't like the topo_shared_cache_add(), it assumes two elements are passed in, what if there are more components sharing a resource? Also, the implementation should probably use links, but procfs is lacking in this respect ... alternate link-like solutions are possible). Feel free to make suggestions. I tested this out on a mips64 platform, but really, you can create the graph on any platform, just to see how it looks. Test output: [guest@trillium]$ cd /proc [guest@trillium]$ find machine -name "*" -print machine machine/node0003 machine/node0003/cpu0007 machine/node0003/cpu0007/shcach0003 machine/node0003/cpu0006 machine/node0003/cpu0006/shcach0003 machine/node0003/memory0003 machine/node0003/memory0003/memsize machine/node0002 machine/node0002/cpu0005 machine/node0002/cpu0005/shcach0002 machine/node0002/cpu0004 machine/node0002/cpu0004/shcach0002 machine/node0002/memory0002 machine/node0002/memory0002/memsize machine/node0001 machine/node0001/cpu0003 machine/node0001/cpu0003/shcach0001 machine/node0001/cpu0002 machine/node0001/cpu0002/shcach0001 machine/node0001/memory0001 machine/node0001/memory0001/memsize machine/node0000 machine/node0000/cpu0001 machine/node0000/cpu0001/shcach0000 machine/node0000/cpu0000 machine/node0000/cpu0000/shcach0000 machine/node0000/memory0000 machine/node0000/memory0000/memsize [guest@trillium]$ cat machine/node0001/memory0001/memsize 8 ^^^^^ this is obviously buggy! As you can see, the devices and internode distance part is still to be done. I am adding in more of the ATLAS folks on the cc list. Kanoj