Process Scheduling and Memory Placement - Design Notes

Released to the public domain, by Paul Jackson, SGI, 2 July 2001.

Objective

Desired Properties

Vendor neutral base
Respect classic kernel-user split
Small, simple kernel patch
(Nearly) harmless for small systems
Sufficient, essential and minimal primitives for large systems

What notions do we have?

Physical processors or cpu's. Tasks are scheduled on cpu's. This does not include special purpose processors, such as dma engines, vector processors, graphics pipelines, routers and switches.
Banks of physical memory or mem's. On a NUMA system, banks are the equivalence classes of main memory locations under the relation of distance from the cpu's. On a typical SMP or UP system, all memory is the same distance from any cpu (same speed), and equivalent for our purposes. Though not necessarily, as it is possible even on a small UP system to have multiple main memory banks, of differing speeds, which would count here as multiple mem's. We do not include here special purpose memory, such as I/O and video frame buffers, caches, peripheral registers and I/O ports.

What we want to do with these notions?

What notion are we missing?

Attempts to put cpusets, nodesets, topoligies, various scheduling policies, quads, and such into the kernel are typically trying to place too much complexity, or too much policy, or perhaps too vendor specified details, into the kernel.

The cpus_allowed bit vector (added by Ingo for the Tux kernel-based web server, and used in the task scheduling loop to control which cpus a task can run on) is a step in the right direction, but exposes an application to the specific details of which physical cpu numbers it is running on, which is not quite abstract or virtual enough. In particular, the application focused logic to use certain logical cpus or memories assigned to it in various ways (run this task here, put that shared memory region there) is confused with the system administrators decision to assign a set of resources (cpu and memory) to a job.

This Design Note proposes to add a single additional kernel notion: the CpuMemSet.

What is a CpuMemSet?

A system administrator or a system service with root permissions creates CpuMemSets, and assigns jobs (and indirectly, the VirtualMemoryRegion's created by those jobs) to run in a specified CpuMemSet. By default, during kernel boot, CpuMemSet of id == 0 is created, to include the first WORDSIZE (32 or 64) cpu's and the first WORDSIZE mem's.

Each process belongs to one CpuMemSet, which controls on which cpus that process may be scheduled. By default, the init (pid == 1) process belongs to CpuMemSet == 0, and any other process belongs to the CpuMemSet its parent belonged to when forked. Each memory region belongs to one CpuMemSet, which controls from which memory banks that region may allocate additional memory. By default, a memory region belongs to whatever CpuMemSet its creating process belonged to.

Regardless of the physical names or numbers of the assigned cpu's and mem's in a CpuMemSet, the job or region constrained to that CpuMemSet sees that it has some number of cpu's, say N, numbered 0 .. N-1, and some number of mem's, say M, numbered 0 .. M-1. A job may control for itself on which of the cpu's in its CpuMemSet any given task of that job is scheduled, and the job may control on which of the M mem's in its CpuMemSet any newly created VirtualMemoryRegion may allocate memory.

The kernel provides information, via /proc, of the number of cpu's and mem's, and of the distance between them, so that sufficiently intelligent system administrators and services can assign "closely" placed cpu's and mem's (perhaps all on the same node or quad) to the same CpuMemSet, for optimal performance. But the kernel has no notion (for the purpose of this service) of topology, nodes or quads. Nor does the kernel task scheduler or memory allocation code pay any attention to this distance; it just reports that value to the user code.

Processors, Memory and Distance

We have two sets of physical resources, processors and memory banks, and two corresponding sets of virtual resources, execution of tasks and virtual memory regions. Both the physical resources should be virtualized to the process, so that a given process can attach a small integer handle to a given processor or memory bank, and then make further resource allocation requests in terms of those handles.

Processors are separately and parallel scheduled general purpose execution units. Memory banks are partition classes of physical general purpose system ram, such that any two distinct locations within the same bank are the same distance from all processors, and for any two separate banks, there is normally at least one processor such the two banks are at a different distance from that processor. The distance from a given processor to a given memory bank is a scalar approximation of that memory's latency and bandwidth, when accessed from that processor. The longer the latency and the lower the bandwidth, the higher the distance.

Not all the processing or memory elements visible on the system bus are general purpose. There may be I/O buffer memory, DMA engines, vector processors and frame buffers. We care about the distance from any processing element, whether a general purpose cpu or not, to any memory element, whether system RAM or not.

Kanoj, as part of the LinuxScalabilityEffort, has proposed a NxN distance vector between any two processors. The above replaces that with a distinct M-length distance vector for each processor, giving the distance from that processor to each of M Memory banks. In general, processors don't read and write other processors. Rather they read and write shared memory, and communicate with other processors via that shared memory.

CpuMemSets

On large NUMA systems, administrators will often want to control which subset of processors and memory is devoted to which major application. This can be done using "hard" partitions, where subsets of the system are booted using separate system images, and the partitions act as a cluster of distinct computers, rather than a SingleSystemImage computer. Doing so partially defeats the advantages of a large SMP or NUMA machine. At times it would be nice to be able to carve out more flexible, possibly overlapping, partitions of the systems cpus and memory, without rebooting, and allowing all processes to see a SingleSystemImage.

Roles of System Manager versus Application Architect

The role of a System Manager with regards to the systems processor and memory is to allocate portions of the system to various applications, usually with simple default policies, such as "spread things out evenly", and occassionally with more precision, controlling exactly which cpus and memories a particular application uses.

The role of an Application Architect in this regard is to specify for a given application the details of just which cpus and memory available are used to schedule which tasks, and to allocate which memory.

The System Manager is managing a particular physical computer system, preferring to remain relatively oblivious to the inards of applications, and the Application Architect is managing the details of task and memory usage within a single application, perferring to ignore the details of the particular system being used to execute the application.

The System Manager does not usually care whether the application puts two particular threads on the same cpu or different, and the Application Architect does not care whether that cpu is number 9 or number 99 in the system.

Roles Define a CpuMemSet

If the Manager has in mind to allow some processes to run "anywhere that there is room", they have in mind the first "global" partition. If they have in mind to restrict an app to just one node, they have in mind an instance of the second, "node" partition. Meanwhile the Application Architect doesn't care which node the app runs on, but might well care about the details of processor and memory allocation within its assigned node.

These separate roles, perspectives and needs lead to a concept of a CpuMemSet, and to the tools and interfaces required to assign cpu and memory to a CpuMemSet (by the System Manager), and to control the detailed use of the cpu and memory available within a partition (by the Application Architect). One works from outside an application, constructing a CpuMemSet out of available physical cpu and memory, and then allocating that Partition to an app. The other works from inside the app, controlling the more fine grane memory allocation and processor scheduling policies.

The Manager will need tools to construct CpuMemSets, assign, specifically or by a general policy, applications to them. The Architect will need system calls that either specifically or by general policy attach scheduable tasks to available cpus, and memory allocation requests to specific RAM.

Software Components to be Developed

The following software capabilities need to be developed to provide these capabilities.

Discover cpu and memory.
There must be a way for a system administrator or system service to discover, through the file system most likely (/proc, say) what physical processors and memory banks are known. The current /proc/cpuinfo and meminfo interfaces are perhaps too non-standard for this, and require parsing text presentations to discover the physical resources available. Perhaps we should provide a second parallel more API-friendly /proc structure to provide specifically the above information.
Virtual mapping of available processors and memory banks to application.
There must be a mapping of the N processors and M memory banks available to a process to uniform handles, such as the integers 0..N-1 and 0..M-1, independent of their physical names or numbers. That way, an application is isolated from changes, even dynamic changes during a single execution, in the numbering or naming of the particular processors or memory banks assigned to it. The processor and memory handles must be virtualized from the actual hardware, in a simple fashion. These mappings must be visible to both the system administrator and the application (at least to libraries linked with and operating on behalf of the application, in order to provide the virtual application distance vector implementation, in the "Processor to Memory Distance" item, below.)
Cpus allowed.
There must be a process attribute, inherited across fork and exec, for specifying on which cpus a task can be scheduled. This actually exists now, thanks to Ingo's work on Tux, as the "cpus_allowed" bit vector, that the scheduler uses to restrict scheduling choices, though below I will propose different implementation details.
Memories allowed.
There must be a virtual memory region attribute, applicable to all tasks sharing that region, for specifying from which memory banks that region can have more memory allocated. This also determines, when pages of a memory region are swapped back in, to which memory banks they may be swapped.
Initialization.
There must be proper initialization of the above per-process and per- region attributes, to allow all processors and all memory to be used.
Processor to memory distance.
In order for both the system administrator and the application to adapt to the "Non-Uniform" aspects of various ccNUMA architectures and topologies, using a mechanism that is topology neutral, there must be a system provided table of distances, from each processor to each bank of memory. The administrator requires these in terms of all the physical processor and memory bank names, likely visible in /proc. The application requires these in terms of the virtual processor and memory numbering visible to it. The virtual application distance information can be constructed on the fly by a user level library, using the information available from the system for both the distances between the physical resources, and for the mapping between the systems physical numbering and the applications virtual numbering.
A way for Application to change setting from within.
There must be a means for a process to change both the per-process task scheduling and the per-region memory allocation settings, to include changing which policy is in affect, tunable parameters of that policy, and in particular the specific vectors of "cpus_allowed" and some similar "mems_allowed". I propose that this is done using two new ulimit(2) options, one for processors and the other for memory banks, each providing for two values of that vector, one "max" value that can only be increased by root, and a second "current" value, respected by the process scheduler or memory allocator, and variable by the current process up to the limit of the "max" value. The ulimit setting for cpus_allowed would affect future scheduling decisions, and the ulimit setting for mems_allowed would affect the mems_allowed setting for future memory regions created by that process.
It isn't clear that there is a clean and simple way, given the current system calls available, to spell the command to change the mems_allowed for allocation by an existing memory region. However the implementation of PageMigration that automatically migrates pages to keep them within their current CpuMemSet might be sufficient here.
Various grand-daddy processes, such as init, inetd and batch schedulers, may provide configurable means, on certain platforms, to fine tune which processors and memory banks to allocate to which job or session.
A way for a system utility to change from without.
System administrators and services require a means for one process to affect the above settings for another process, in order to externally request migration of a process from one processor or memory bank to another. The permissions on changing the "current" values, within the constraints of the "max" setting, perhaps resemble the permissions required for one process to successfully signal (kill) another process. Presumably only root can change the "max" setting of another process.

Selected Details of Implementation

The following details sketch a simple implementation that provides the above capabilities.

The cpumemset struct:
```
        struct cpumemset { 
            int physcpu[WORDSIZE];      # map phys cpu number to virtual 
            int physmem[WORDSIZE];      # map phys mem number to virtual 
        }; 
    
```
A given cpumemset structure contains two arrays, one of them mapping up to WORDSIZE virtual cpu numbers to a corresponding physical cpu number, and the other doing the same for memory banks.
The cpumemsets array:
```
        struct cpumemset *cpumemsets[]; # kernel's collection of cpumemsets 
    
```
The kernel has a single global array of pointers to cpumemsets. Initially, at boot, the kernel constructs a single element of this array, cpumemsets[0], which contains the cpumemset for the up to the first 64 cpus and mems in the system.
For single-cpu systems, this is the only possible cpumemset (though nothing prevents a system administrator or service from adding additional elements cpumemsets[] that describe the same cpumemset.)
On larger multiprocessor systems, there might be 100's or 1000's of various cpumemsets[], describing various combinations of the available cpu's and mem's. Typically, though, I wouldn't expect that many cpumemsets[].
The index of a particular cpumemset{} in this cpumemsets[] array is a globally visible quantity, with the same significance to any process in the system.
Since it is not expected that this cpumemsets[] array will change that rapidly, after its initial construction, there is no apparent performance penalty to having it global, rather than per-node.
Per task cpumemset index:
```
        int cpumemset; 
    
```
Each task belongs to a cpumemset, and has in its task structure the index of that cpumemset in the cpumemsets[] global array.
Per VirtualMemoryRegion cpumemset index:
```
        int cpumemset; 
    
```
Each VirtualMemoryRegion belongs to a cpumemset, and has in its region structure the index of that cpumemset in the cpumemsets[] global array.
Per task cpus_allowed:
```
        unsigned long max_cpus_allowed; 
        unsigned long cur_cpus_allowed; 
    
```
Each task has max and cur cpus_allowed bit vectors. The task scheduler will only schedule a task on a given physical cpu if that cpu is listed in that tasks cpumemset physcpu[] array for some index and that same index is set in that tasks cur_cpus_allowed bit vector.
Per VirtualMemoryRegion mems_allowed:
```
        unsigned long max_mems_allowed; 
        unsigned long cur_mems_allowed; 
    
```
Each VirtualMemoryRegion has max and cur mems_allowed bit vectors. The memory allocator(s) will only allocate memory to a given region from a particular memory bank if that bank is listed in that regions cpumemset physmem[] array for some index, and that same index is set in that regions cur_mems_allowed bit vector.
Initialization and Inheritance:
The cpumemsets[] is initialized at boot as described above.
Also during boot, the init (pid==1) process is assigned this first cpumemsets[] element, cpumemset == 0. Each time a process is forked, the child inherits its cpumemset value from its parent.
Each time a new VirtualMemoryRegion is created, it inherits its cpumemset value from its creating process.
System Call Operations Required:
- Create a new empty cpumemset.
- Change the value (two WORDSIZE element arrays) of a cpumemset
- Retrieve the value (two WORDSIZE element arrays) of a cpumemset
- Delete an existing cpumemset (error if any task/region using)
- Change task (specified pid) to use a specified cpumemset
- List tasks (pids) using a specified cpumemset
- Get physical cpu to memory distances (via /proc?)
- Ulimit options to set max and cur cpus and mems allowed
Other areas we likely need to change:
- Additional C library supporting routines, syscall wrappers
- System and user header file additions
- System Administration utilities, with man pages
- Ulimit shell builtin extensions
- /proc displays of distance and cpumemset values
- Kernel and user debugger and PCP hooks