Topology API

Version: 0.4
Date: October 25, 2002

Revision History

Versions 0.1 and 0.2 by Paul McKenney, IBM, 2001
Version 0.3 by Michael Hohnbaum, IBM, February 11, 2002
Version 0.4 by Matthew Dobson, IBM, September 6, 2002

General

This Topology API is intended to be a "simple" API which provides rudimentary topology discovery of processors, memory blocks, (I/O busses, ) and nodes. This API is being defined in such a way that it is architecture agnostic and can be mapped onto (hopefully) any platform. As such it only has CPUs, Memory Blocks, and Nodes as basic building blocks.

Include Files

The definitions are accessed via:

#include <asm/topology.h>

Each architecture (and sub-arch) should define their own versions of the functions specified below. For now, any architecture that has not written arch-specific code for their topology.h file, automatically uses the asm-generic version which implements generic, non-numa versions of the calls.

CPUs, Memory Blocks, and Nodes

The purpose of this API is to create a useful, flexible, generic topology infrastructure for the Linux kernel, and also to export that in a meaningful way to userspace. To that end, we try and keep the assumptions about the underlying hardware to a minimum. As mentioned above, this is done by only specifying the largest and most important system components in the topology.

Definitions

CPU

This is a straightforward definition. A CPU in the topology represents an actual, physical CPU in the system.

Memory Block

A Memory Block is defined to be a physically contiguous block of memory. Memory Block is often written more concisely as memblk. There has been much discussion on whether or not to allow multiple memblks per node, a strict 1-1 memblk to node mapping, allowing nodes to have only 0 or 1 memblks, or doing away with the memblk concept entirely. Right now, the API assumes that there are 0 or 1 memblks per node. This is not hard-coded anywhere, so that in the future it could change, if necessary. The CONFIG_NONLINEAR option may make the possibility of dealing with multiple memblks per node unnecessary.

Node

A Node in the Topology API is no more or less than an abstract container for other topology elements. The node often approximates the 'physical node' building block that the underlying system may be composed of, but I cannot emphasize this strongly enough: They are NOT the same! The node is simply a container for CPUs and Memory Blocks (and System Busses, and ...). The reason for this is that not all architectures are composed of 'physical nodes' per se. Even for those platforms that are composed of 'physical nodes', the actual 'physical nodes' may be different from the 'physical nodes' of other platforms. For these reasons we use the node as only an abstract container type.

Numbering

In architectures that do not allow CPUs and nodes to be dynamically added to a running system without a reboot, CPUs and nodes are numbered consecutively from zero. Each node's CPUs are numbered consecutively.

Systems that can dynamically remove CPUs or nodes from a running system may have "holes" in the numbering scheme. However, if new CPUs are introduced, they will appear in the same range as other CPUs on the same node, and if new nodes are introduced, their CPUs will be consecutively numbered. CPUs from different nodes are never interleaved.

This means that if a node has the capacity to have additional CPUs added to it, space must be left in the numbering scheme to accommodate those additional CPUs.

A NUMA node may contain zero, one, or more memory blocks. As the Linux kernel in general does not support multiple pgdats per node, the topology does not explicitly support multiple memory blocks per node. While it is possible that there is a one to one relationship of memory blocks to NUMA nodes, this is not guaranteed, and NUMA implementations are expected that do not adhere to this relationship. Each memory block has a distinct memory block number. The NUMA topology description provides this numbering and the linkage between memory blocks and NUMA nodes.

See the rationale [update this too!] for example numberings for different architectures.

Simple Topology Discovery

Userspace topology discovery is provided via driverfs. This is being further developed by Matthew Dobson to provide a more complete topology discovery and reporting mechanism. There are patches to the kernel currently available [insert link], and the code is also a part of Andrew Morton's experimental tree [link to Andrew's tree].

It is also useful to have a C-language API providing an efficient and simple means of getting a few critical pieces of information. The following functions are defined here as they are necessary for supporting a minimal Topology API.

```
 int __cpu_to_node(int cpu);
```
Given a CPU number, return the number of the node containing that CPU. If the architecture supports hierarchical NUMA (nodes containing other nodes), then the lowest level node (i.e., the node most immediately containing the cpu) is returned.

Returns a node number, or a negative errno if an error occurs.
```
 int __memblk_to_node(int memblk);
```
Given a memory block number, return the number of the node containing that memblk. If the architecture supports hierarchical NUMA (nodes containing other nodes), then the lowest level node (i.e., the node most immediately containing the memblk) is returned.

Returns a node number, or a negative errno if an error occurs.
```
 int __parent_node(int node);
```
Given a node number, return the number of the parent node (i.e., the node that contains it). This is useful for hierarchical NUMA machines which may have nested NUMA nodes.

Returns a node number, or a negative errno if an error occurs.
```
 unsigned long __node_to_cpu_mask(int node);
```
Given a node number, return a bitmask of CPUs on that node. This interface may soon be changed to take a pointer to a bitmask as an additional argument, as there is motivation to allow more CPUs than BITS_PER_LONG.

Returns a CPU bitmask. An empty bitmask may signify either a node with no CPUs, or an invalid node number.
```
 int __node_to_memblk(int node);
```
Given a node number, return the number of the first memory block on that node.

Returns a memblk number, or a negative errno if an error occurs.
```
 int get_curr_cpu();
```
Current executing CPU. This is similar to smp_processor_id(), but will be available at user level.

Returns a CPU number.
```
 int get_curr_node();
```
NUMA node containing the currently executing CPU.

Returns a node number.

Future Extensions

There are additional capabilities that could be implemented with the Topology API. Some have been identified and are listed below:

Functions to bind/restrict to a node and obtain node binding/restriction information. This capability can be obtained by using the CPU and memblk specific calls. [Insert link for MemBind API]
Virtual address or page to memory block function (i.e., __va_to_memblk(), __page_to_memblk()).
Adding various I/O (ie: PCI) busses to the base elements of the topology: __pcibus_to_node(). This allows for things like Multi-Path I/O, and intelligent bindings for I/O intensive processes.