To: Paul McKenney/Beaverton/IBM@IBMUS
cc: andi@suse.de, andrea@suse.de, aono@ed2.com1.fc.nec.co.jp, beckman@turbolabs.com, bjorn_helgaas@hp.com, Hubertus Franke/Watson/IBM@IBMUS, Jerry.Harrow@Compaq.com, jwright@engr.sgi.com, kanoj@engr.sgi.com, kumon@flab.fujitsu.co.jp, norton@mclinux.com, suganuma@hpc.bs1.fc.nec.co.jp, sunil.saxena@intel.com, tbutler@hi.com, woodman@missioncriticallinux.com, mikek@sequent.com, Kenneth Rozendal/Austin/IBM@IBMUS
Date: 03/29/01 09:26 PM
From: Kimio Suganuma
Subject: Re: NUMA-on-Linux roadmap, version 2
Hi all,
I feel this topology descriptor might be also useful
for other purposes. For example, for hot-plugging,
dynamic patitioning, MCA handling and so on.
Do you think this /proc information should be generic
or NUMA specific?
Kimi
> Topology Discovery
>
> o Methods from prior NUMA OSes:
>
> o SGI's /hardware (sp?).
>
> o ptx's TMP_* commands to the tmp_affinity() system call.
>
> o Compaq's "foreach()" library functions (see table 2-2 of
> the .pdf file).
>
> o AIX's rs_numrads(), rs_getrad(), rs_getinfo(), and
> rs_getlatency() library functions.
>
> o Suggestions from meeting involving extensions to /proc filesystem:
>
> 1. /proc/nodeinfo "flat file" with tags similar to /proc/cpuinfo.
> For example:
>
> node: 0
> PCI busses: 0 1
> PCI cards: ...
>
> A hierarchical NUMA architecture could be handled in this
> scheme by using dot-separated node numbers like this:
>
> node: 1.0
>
> which would indicate the zeroth node within node 1.
>
> 2. /proc/nodeinfo containing relationship information, one
> "parent" entity per line. For example:
>
> toplevel: node0 node1 node2
> node0: cpu0 cpu1 cpu2 cpu3 mem0 pcibus0 pcibus1
> pcibus0: dev0 otherdev0
>
> Again, dot-separated numbers could be used to specify
> hierarchical NUMA architectures:
>
> node1: node1.0 node1.1 node1.2 node1.3
>
> 3. /proc/nodeinfo containing relationship information, one
> relationship per line. This allows better tagging with
> attributes (for example, memory size, bus speed, "distance"
> estimates, etc.). It also allows non-tree architectures
> to be specified easily. For example:
>
> toplevel: node0
> toplevel: node1
> toplevel: node2
> node0: mem0 512m
> node0: cpu0 cpu1:1 cpu2:1 cpu3:1 mem0:1 cpu4:2 # object:dist
> node0: cpu1 cpu0:1 cpu2:1 cpu3:1 mem0:1 cpu4:2 # object:dist
> node0: cpu2 cpu0:1 cpu1:1 cpu3:1 mem0:1 cpu4:2 # object:dist
> node0: cpu3 cpu0:1 cpu1:1 cpu2:1 mem0:1 cpu4:2 # object:dist
> node0: pci0 66mhz
> pci0: eth0 1gb
>
> The record for each CPU includes the "distances" to each
> other CPU and to each memory.
>
> Hierarchical NUMA architectures would just indicate
> the relationships:
>
> node1: node1.0
>
> 4. /proc/node containing a directory structure describing the
> system architecture. There would be a directory for each
> node, which could contain subordinate directories for
> subordinate nodes, as well as a file describing each resource
> contained in the given node. In this case, we don't need
> to prefix each node number with "node", since the fact that
> there is a directory determines that it must be a node.
>
> The example above would have /proc/node containing directories
> for each node, named "0", "1", and so on. These per-node
> directories could themselves contain directories named
> "0.0", "0.1", and so on to indicate nested NUMA nodes.
> The directory /proc/node/0 would contain files cpu0, cpu1,
> cpu2, and cpu3, and these files would contain distance
> records to each other cpu and to each memory. For example,
> /proc/node/0/cpu0 might contain:
>
> cpu1 1
> cpu2 1
> cpu3 1
> mem0 1
> cpu4 2
>
> Non-tree architectures (e.g., the ring example discussed
> yesterday) could use symbolic links or circular paths as
> needed.
>
> In all these examples, the kernel would need to maintain the topology
> information in some convenient form: a linked structure, for example.
>
> So where do you get the distance information from? I believe that
> this must be supplied by architecture-specific functions: on some
> systems, the firmware might supply this information, while on others,
> the architecture-specific Linux code will likely have to make
> assumptions based on the system structure. Would an API like the
> following be appropriate?
>
> int numa_distance(char *obj1, char *obj2)
>
> where obj1 and obj2 are the names of the CPUs or memories for which
> distance information is desired.
>
> All these proposals leave open the question of how to represent
> distance information from memory and CPU to devices.
>
> Can we simplify this? In most architectures I have dealt with, you can
> do pretty well just by counting "hops" through nodes or interconnects.
> I have a really hard time believing that any software is going to go to
> the trouble of really handling the n-squared distance metrics without also
> going to the trouble of just measuring the bandwidths and latencies.
>
> Thoughts? Other approaches?
>
> Thanx, Paul
>
--
Kimio Suganuma