To: Paul McKenney/Beaverton/IBM@IBMUS, tony.luck@intel.com, kouchi@hpc.bs1.fc.nec.co.jp, asit.k.mallick@intel.com, jenna.s.hall@intel.com
Cc: andi@suse.de, andrea@suse.de, aono@ed2.com1.fc.nec.co.jp, beckman@turbolabs.com, bjorn_helgaas@hp.com, Hubertus Franke/Watson/IBM@IBMUS, Jerry.Harrow@Compaq.com, jwright@engr.sgi.com, kanoj@engr.sgi.com, Kenneth Rozendal/Austin/IBM@IBMUS, kumon@flab.fujitsu.co.jp, mikek@sequent.com, norton@mclinux.com, suganuma@hpc.bs1.fc.nec.co.jp, sunil.saxena@intel.com, tbutler@hi.com, woodman@missioncriticallinux.com
From: kanoj@google.engr.sgi.com
Date: 04/05/01 01:38 AM
Subject: Re: NUMA-on-Linux roadmap, version 2
> o Suggestions from meeting involving extensions to /proc filesystem:
>
> 1. /proc/nodeinfo "flat file" with tags similar to /proc/cpuinfo.
> For example:
>
> node: 0
> PCI busses: 0 1
> PCI cards: ...
>
> A hierarchical NUMA architecture could be handled in this
> scheme by using dot-separated node numbers like this:
>
> node: 1.0
>
> which would indicate the zeroth node within node 1.
>
> 2. /proc/nodeinfo containing relationship information, one
> "parent" entity per line. For example:
>
> toplevel: node0 node1 node2
> node0: cpu0 cpu1 cpu2 cpu3 mem0 pcibus0 pcibus1
> pcibus0: dev0 otherdev0
>
> Again, dot-separated numbers could be used to specify
> hierarchical NUMA architectures:
>
> node1: node1.0 node1.1 node1.2 node1.3
>
> 3. /proc/nodeinfo containing relationship information, one
> relationship per line. This allows better tagging with
> attributes (for example, memory size, bus speed, "distance"
> estimates, etc.). It also allows non-tree architectures
> to be specified easily. For example:
>
> toplevel: node0
> toplevel: node1
> toplevel: node2
> node0: mem0 512m
> node0: cpu0 cpu1:1 cpu2:1 cpu3:1 mem0:1 cpu4:2 # object:dist
> node0: cpu1 cpu0:1 cpu2:1 cpu3:1 mem0:1 cpu4:2 # object:dist
> node0: cpu2 cpu0:1 cpu1:1 cpu3:1 mem0:1 cpu4:2 # object:dist
> node0: cpu3 cpu0:1 cpu1:1 cpu2:1 mem0:1 cpu4:2 # object:dist
> node0: pci0 66mhz
> pci0: eth0 1gb
>
> The record for each CPU includes the "distances" to each
> other CPU and to each memory.
>
> Hierarchical NUMA architectures would just indicate
> the relationships:
>
> node1: node1.0
>
> 4. /proc/node containing a directory structure describing the
> system architecture. There would be a directory for each
> node, which could contain subordinate directories for
> subordinate nodes, as well as a file describing each resource
> contained in the given node. In this case, we don't need
> to prefix each node number with "node", since the fact that
> there is a directory determines that it must be a node.
>
> The example above would have /proc/node containing directories
> for each node, named "0", "1", and so on. These per-node
> directories could themselves contain directories named
> "0.0", "0.1", and so on to indicate nested NUMA nodes.
> The directory /proc/node/0 would contain files cpu0, cpu1,
> cpu2, and cpu3, and these files would contain distance
> records to each other cpu and to each memory. For example,
> /proc/node/0/cpu0 might contain:
>
> cpu1 1
> cpu2 1
> cpu3 1
> mem0 1
> cpu4 2
>
> Non-tree architectures (e.g., the ring example discussed
> yesterday) could use symbolic links or circular paths as
> needed.
>
Okay, I decided to prototype the last option. Reason being, it gives a
way for the user to reference chipset registers. As well as being the most
extensible, and being able to handle shared resources (caches) the best.
The naming I finally came up with is a bit different from what the option
suggests (mainly because I implemented first, then looked at what option
I had implemented!), I wouldn't have issues changing it to match exactly
the option.
Patch is at
http://oss.sgi.com/projects/numa/download/naming242.patch
Most importantly, browse the topology.h file, then look at the implementation
in topology.c if you have patience.
Btw, the implementation currently uses /proc, but new uses of /proc
are strongly discouraged. devfs is probably a better idea. A new fs is also
possible. In any case, naming implementations are going to change a lot
in 2.5, so any discussions about underlying implementations are probably
not that worthwhile. This also means that the kernel interfaces are open
to change (Eg, I don't like the topo_shared_cache_add(), it assumes two
elements are passed in, what if there are more components sharing a
resource? Also, the implementation should probably use links, but procfs
is lacking in this respect ... alternate link-like solutions are possible).
Feel free to make suggestions. I tested this out on a mips64 platform,
but really, you can create the graph on any platform, just to see how
it looks. Test output:
[guest@trillium]$ cd /proc
[guest@trillium]$ find machine -name "*" -print
machine
machine/node0003
machine/node0003/cpu0007
machine/node0003/cpu0007/shcach0003
machine/node0003/cpu0006
machine/node0003/cpu0006/shcach0003
machine/node0003/memory0003
machine/node0003/memory0003/memsize
machine/node0002
machine/node0002/cpu0005
machine/node0002/cpu0005/shcach0002
machine/node0002/cpu0004
machine/node0002/cpu0004/shcach0002
machine/node0002/memory0002
machine/node0002/memory0002/memsize
machine/node0001
machine/node0001/cpu0003
machine/node0001/cpu0003/shcach0001
machine/node0001/cpu0002
machine/node0001/cpu0002/shcach0001
machine/node0001/memory0001
machine/node0001/memory0001/memsize
machine/node0000
machine/node0000/cpu0001
machine/node0000/cpu0001/shcach0000
machine/node0000/cpu0000
machine/node0000/cpu0000/shcach0000
machine/node0000/memory0000
machine/node0000/memory0000/memsize
[guest@trillium]$ cat machine/node0001/memory0001/memsize
8
^^^^^ this is obviously buggy!
As you can see, the devices and internode distance part is still to
be done.
I am adding in more of the ATLAS folks on the cc list.
Kanoj