To: Paul McKenney/Beaverton/IBM@IBMUS
cc: andi@suse.de, andrea@suse.de, Hubertus Franke/Watson/IBM@IBMUS, kanoj@engr.sgi.com, bjorn_helgaas@hp.com, jwright@engr.sgi.com, sunil.saxena@intel.com, Jerry.Harrow@Compaq.com, kumon@flab.fujitsu.co.jp, aono@ed2.com1.fc.nec.co.jp, suganuma@hpc.bs1.fc.nec.co.jp, woodman@missioncriticallinux.com, norton@mclinux.com, beckman@turbolabs.com, tbutler@hi.com
Date: 03/28/01 03:57 PM
From: Kanoj Sarcar
Subject: Re: your mail
> o Memory-allocation policies (e.g., DISCONTIGMEM, first-touch
> allocation,
> node-specific page allocation, node-specific bootmem allocation,
> read-only mapping replication, per-node kmalloc(), per-node
> kswapd/kreclaimd. Status: mostly done in 2.4
Just to be clear: _kernel_ read-only replication has been implemented
on some architectures. Per-node kmalloc has not been, and volunteers
in this area are needed. We need the option of architecture code to
ask for pernode kswapd/kreclaimd (do not assume all NUMA machines
will want pernode daemons, some might want a daemon for four nodes
etc), and this is 2.5 work.
>
> o Hierarchical scheduling policies. Status: some patches available.
> Urgency: high.
What is hierarchical scheduling? Handling multilevel caches, nested-numa
etc? If so, as I mentioned, some progress has been made in the sense that
Linus acknowledges the issue and is open to a per-arch scheduling solution.
IE, goodness() will be replaced with arch_goodness() etc.
> o Multipath I/O. Status: efforts in progress. Urgency: low to
> moderate,
> depending on architecture.
It wasn't clear to me what the relationship between Multipath I/O
and NUMA is. True, in case a user program wants to allocate memory near
a device, he has more choices if the device is multipath. Are there other
relationships I am missing?
> o Support for nested NUMA (e.g., nodes composed of several multi-CPU
> chips). Status: Not yet started. Urgency: moderate, depending
> on architecture.
Only scheduling optimizations here, right?
> o Support for non-tree topologies. There were some comments about
> this possibility yesterday. Does anyone wish to stand up and be
> counted? It does add some complexity, so do we really want to
> support it?
Hmm, IRIX's /hw ie hwgfs is a graph (hardware-graph file system). This
is because it represents the interconnect between nodes. I remember
graphs and trees being mentioned while talking about /proc extensions,
is that what you are talking about here?
>
> Category 2: Facilities to bind tasks and memory to CPUs and nodes
>
> o Topology discovery (see below for some approaches).
> Status: Not yet started. Urgency: high.
I suspect this is completely architecture/platform specific. The generic
NUMA code will expect a distance matrix distances[N][N], which the
platform code has to populate. Device-node distance is rarely asked for,
so it is okay to have something like a procedure call which you talk
about later.
>
> o Binding tasks to CPUs or groups of CPUs (export runon/cpuset
> functionality). Status: some patches available. Urgency: high.
>
> o Binding tasks and related groups of tasks to nodes or groups
> of nodes. Status: not yet started. Urgency: high.
>
> o Binding memory to nodes or groups of nodes. Status: not yet started.
> Urgency: high.
I think I covered some reasons why SGI thinks the above are needed.
Basically, soft partitioning (to reduce impact of one set of programs
on another), and more repeatibility in performance. Do others have other
reasons for these features?
Both repeatibility and partitioning can be addressed by nodesets (which
was an advanced option as I mentioned on the slide, but never got around
to talking about), which I _think_ is easier to understand. Of course,
cpusets/memory-placement gives more detailed control. Whether NUMA uses
cpusets will also be dictated at least partly by whether cpusets become
part of SMP kernels.
>
> o General specification of "hints" that indicate relationships
> between different objects, including tasks, a given task's
> data/bss/stack/heap ("process" to the proprietary-Unix people
> among us), a given shm/mmap area, a given device,
> a given CPU, a given node, and a given IPC conduit.
> Status: Needs definition, research, prototyping, and
> experimentation. Urgency: Hopefully reasonably low,
> given the complexities that would be involved if not done
> correctly.
I remember we talked about kernel-to-user copying, but we decided it was
a non issue, right? The thread could just ask to be close to the device.
I also remember thread wanting to be close to IPC conduit, was the example
that two threads were communicating via pipes, and both should be near the
kernel maintained buffer for the pipe? Or was it something else? I thought
we decided this could be handled by asking the kernel to schedule the
threads on the same node, so that (hopefully) the kernel buffer would be
on the same node.
Is there a reason other than the above for two threads to be wanting to
be on the same cpu (I know IBM has the concept of sharing cache affinity
between multiple threads)? Or being on the same node (forget node level
caches, local snoops etc)?
>
> So where do you get the distance information from? I believe that
> this must be supplied by architecture-specific functions: on some
> systems, the firmware might supply this information, while on others,
> the architecture-specific Linux code will likely have to make
> assumptions based on the system structure. Would an API like the
> following be appropriate?
>
> int numa_distance(char *obj1, char *obj2)
>
> where obj1 and obj2 are the names of the CPUs or memories for which
> distance information is desired.
Much simpler to have a matrix populated at boot up time. (Lets not worry
right now about hot plug, we can move to a procedure call if hot plug
demands it).
>
> All these proposals leave open the question of how to represent
> distance information from memory and CPU to devices.
Chances are, we are not going to use device distances. If the user wants
to allocate memory near device D1, the kernel will look at the device
handle (what this will end up to be is an open issue, ie, is it ascii
name, devfs handle, pci_dev * etc) and figure out the controlling/master
node (with something like master_node(void *device_handle), and try to
allocate memory on that node (falling back on the next best node based
on inter-node distances). Similarly for user wanting to be on a cpu
close to device.
If it really came to needing device distances, I would suggest something
like
device_distance(void *device_handle, char *kvaddr)
to find distance from device to some piece of memory.
This raises an interesting thing that I am taking for granted. All ccNUMA
machines use the same bus/network/switches for PIO, DMA and internode
memory transfer paths. So, the PIO distance of a cpu to a device would be
the same as the DMA distance from the containing node to the device,
also equal to the memory distance from any cpu on the containing node
to the master node of the device. Is there an architecture that does not
fit into this simplified picture?
>
> Can we simplify this? In most architectures I have dealt with, you can
> do pretty well just by counting "hops" through nodes or interconnects.
Yes, that was my expectation too. Platform code is perfectly welcome to
set distance[i][j] = 1.6 for 1 hop, 4.7 for 2 hops etc though. For most,
1 for 1 hop, 2 for 2 hops would work fine, I suspect.
Kanoj