To: andi@suse.de, andrea@suse.de, Hubertus Franke/Watson/IBM, kanoj@engr.sgi.com, bjorn_helgaas@hp.com, jwright@engr.sgi.com, sunil.saxena@intel.com, Jerry.Harrow@Compaq.com, kumon@flab.fujitsu.co.jp, aono@ed2.com1.fc.nec.co.jp, suganuma@hpc.bs1.fc.nec.co.jp, woodman@missioncriticallinux.com, norton@mclinux.com, beckman@turbolabs.com, tbutler@hi.com
Date: 03/27/01 11:34 PM
From: Paul McKenney
Subject:
Hello!
Thank you all for contributing to yesterday's NUMA-on-Linux get-together!
I believe we made good progress, and look forward to continuing to work
with you all.
I have received a few presentations, if you have not already done so, please
send me softcopy of your presentation so that I can post it on SourceForge.
At the end of the meeting, I said I would send out a list of the NUMA-related
tasks grouped into three categories: (1) make Linux's default handling of NUMA
work well, (2) facilities to bind tasks and memory to CPUs and nodes, and
(3) facilities to provide more general scheduling constraints and hints.
This is -not- a priority list, nor is it a schedule. It is instead simply
a way of grouping the tasks to help understand how they relate.
I have tagged each item with a swag at its current status and urgency.
Although I based these swags on yesterday's conversations and presentations,
I am sure that a number of them could use some adjustment, and I know you
guys will not be shy. ;-)
Category 1: Make Linux's default handling of NUMA work well
o Memory-allocation policies (e.g., DISCONTIGMEM, first-touch allocation,
node-specific page allocation, node-specific bootmem allocation,
read-only mapping replication, per-node kmalloc(), per-node
kswapd/kreclaimd. Status: mostly done in 2.4
o Hierarchical scheduling policies. Status: some patches available.
Urgency: high.
o NUMAization of internal kernel data structures and algorithms,
as indicated by benchmarking and workload experiences.
Status: Not yet started. Urgency: TBD based on benchmark data.
o Multipath I/O. Status: efforts in progress. Urgency: low to moderate,
depending on architecture.
o Migration (has been helpful on some, but not all, architectures).
Status: Not yet started. Urgency: low to moderate, depending on
architecture.
o Support for nested NUMA (e.g., nodes composed of several multi-CPU
chips). Status: Not yet started. Urgency: moderate, depending
on architecture.
o Support for non-tree topologies. There were some comments about
this possibility yesterday. Does anyone wish to stand up and be
counted? It does add some complexity, so do we really want to
support it?
Category 2: Facilities to bind tasks and memory to CPUs and nodes
o Topology discovery (see below for some approaches).
Status: Not yet started. Urgency: high.
o Binding tasks to CPUs or groups of CPUs (export runon/cpuset
functionality). Status: some patches available. Urgency: high.
o Binding tasks and related groups of tasks to nodes or groups
of nodes. Status: not yet started. Urgency: high.
o Binding memory to nodes or groups of nodes. Status: not yet started.
Urgency: high.
Category 3: Facilities to provide more general scheduling constraints and hints
o Facilities that have seen real use:
o Specification of a portion of a machine (e.g., Irix's
mldset_place()). Status: not yet started.
Urgency: High for scientific workloads sharing a machine(?),
low for other workloads (e.g., commercial workloads have
tended to use some form of software partitioning to get this
effect).
o Hints to link tasks and groups of tasks to each other
(e.g., ptx's attach_proc() or AIX's ra_attach()).
Status: not yet started. Urgency: High for commercial
workloads, unknown for other workloads.
o Facilities that seem compelling, but which have not yet seen use:
o "Virtualization" of CPU and node IDs so that multiple apps
can believe that they are scheduling on CPU 0.
Status: not yet started. Urgency: Hopefully reasonably
low, given the sharply differing views on how to go about
doing this. ;-)
o General specification of "hints" that indicate relationships
between different objects, including tasks, a given task's
data/bss/stack/heap ("process" to the proprietary-Unix people
among us), a given shm/mmap area, a given device,
a given CPU, a given node, and a given IPC conduit.
Status: Needs definition, research, prototyping, and
experimentation. Urgency: Hopefully reasonably low,
given the complexities that would be involved if not done
correctly.
Topology Discovery
o Methods from prior NUMA OSes:
o SGI's /hardware (sp?).
o ptx's TMP_* commands to the tmp_affinity() system call.
o Compaq's "foreach()" library functions (see table 2-2 of
the .pdf file).
o AIX's rs_numrads(), rs_getrad(), rs_getinfo(), and
rs_getlatency() library functions.
o Suggestions from meeting involving extensions to /proc filesystem:
1. /proc/nodeinfo "flat file" with tags similar to /proc/cpuinfo.
For example:
node: 0
PCI busses: 0 1
PCI cards: ...
A hierarchical NUMA architecture could be handled in this
scheme by using dot-separated node numbers like this:
node: 1.0
which would indicate the zeroth node within node 1.
2. /proc/nodeinfo containing relationship information, one
"parent" entity per line. For example:
toplevel: node0 node1 node2
node0: cpu0 cpu1 cpu2 cpu3 mem0 pcibus0 pcibus1
pcibus0: dev0 otherdev0
Again, dot-separated numbers could be used to specify
hierarchical NUMA architectures:
node1: node1.0 node1.1 node1.2 node1.3
3. /proc/nodeinfo containing relationship information, one
relationship per line. This allows better tagging with
attributes (for example, memory size, bus speed, "distance"
estimates, etc.). It also allows non-tree architectures
to be specified easily. For example:
toplevel: node0
toplevel: node1
toplevel: node2
node0: mem0 512m
node0: cpu0 cpu1:1 cpu2:1 cpu3:1 mem0:1 cpu4:2 # object:dist
node0: cpu1 cpu0:1 cpu2:1 cpu3:1 mem0:1 cpu4:2 # object:dist
node0: cpu2 cpu0:1 cpu1:1 cpu3:1 mem0:1 cpu4:2 # object:dist
node0: cpu3 cpu0:1 cpu1:1 cpu2:1 mem0:1 cpu4:2 # object:dist
node0: pci0 66mhz
pci0: eth0 1gb
The record for each CPU includes the "distances" to each
other CPU and to each memory.
Hierarchical NUMA architectures would just indicate
the relationships:
node1: node1.0
4. /proc/node containing a directory structure describing the
system architecture. There would be a directory for each
node, which could contain subordinate directories for
subordinate nodes, as well as a file describing each resource
contained in the given node. In this case, we don't need
to prefix each node number with "node", since the fact that
there is a directory determines that it must be a node.
The example above would have /proc/node containing directories
for each node, named "0", "1", and so on. These per-node
directories could themselves contain directories named
"0.0", "0.1", and so on to indicate nested NUMA nodes.
The directory /proc/node/0 would contain files cpu0, cpu1,
cpu2, and cpu3, and these files would contain distance
records to each other cpu and to each memory. For example,
/proc/node/0/cpu0 might contain:
cpu1 1
cpu2 1
cpu3 1
mem0 1
cpu4 2
Non-tree architectures (e.g., the ring example discussed
yesterday) could use symbolic links or circular paths as
needed.
In all these examples, the kernel would need to maintain the topology
information in some convenient form: a linked structure, for example.
So where do you get the distance information from? I believe that
this must be supplied by architecture-specific functions: on some
systems, the firmware might supply this information, while on others,
the architecture-specific Linux code will likely have to make
assumptions based on the system structure. Would an API like the
following be appropriate?
int numa_distance(char *obj1, char *obj2)
where obj1 and obj2 are the names of the CPUs or memories for which
distance information is desired.
All these proposals leave open the question of how to represent
distance information from memory and CPU to devices.
Can we simplify this? In most architectures I have dealt with, you can
do pretty well just by counting "hops" through nodes or interconnects.
I have a really hard time believing that any software is going to go to
the trouble of really handling the n-squared distance metrics without also
going to the trouble of just measuring the bandwidths and latencies.
Thoughts? Other approaches?
Thanx, Paul