Process Scheduling and Memory Placement - Design Notes

Released to the public domain, by Paul Jackson, SGI, 2 July 2001.

Objective

Desired Properties

What notions do we have?

What we want to do with these notions?

What notion are we missing?

What is a CpuMemSet?

Processors, Memory and Distance

CpuMemSets

Roles of System Manager versus Application Architect

Roles Define a CpuMemSet

Software Components to be Developed

The following software capabilities need to be developed to provide these capabilities.

  1. Discover cpu and memory.

    There must be a way for a system administrator or system service to discover, through the file system most likely (/proc, say) what physical processors and memory banks are known. The current /proc/cpuinfo and meminfo interfaces are perhaps too non-standard for this, and require parsing text presentations to discover the physical resources available. Perhaps we should provide a second parallel more API-friendly /proc structure to provide specifically the above information.

  2. Virtual mapping of available processors and memory banks to application.

    There must be a mapping of the N processors and M memory banks available to a process to uniform handles, such as the integers 0..N-1 and 0..M-1, independent of their physical names or numbers. That way, an application is isolated from changes, even dynamic changes during a single execution, in the numbering or naming of the particular processors or memory banks assigned to it. The processor and memory handles must be virtualized from the actual hardware, in a simple fashion. These mappings must be visible to both the system administrator and the application (at least to libraries linked with and operating on behalf of the application, in order to provide the virtual application distance vector implementation, in the "Processor to Memory Distance" item, below.)

  3. Cpus allowed.

    There must be a process attribute, inherited across fork and exec, for specifying on which cpus a task can be scheduled. This actually exists now, thanks to Ingo's work on Tux, as the "cpus_allowed" bit vector, that the scheduler uses to restrict scheduling choices, though below I will propose different implementation details.

  4. Memories allowed.

    There must be a virtual memory region attribute, applicable to all tasks sharing that region, for specifying from which memory banks that region can have more memory allocated. This also determines, when pages of a memory region are swapped back in, to which memory banks they may be swapped.

  5. Initialization.

    There must be proper initialization of the above per-process and per- region attributes, to allow all processors and all memory to be used.

  6. Processor to memory distance.

    In order for both the system administrator and the application to adapt to the "Non-Uniform" aspects of various ccNUMA architectures and topologies, using a mechanism that is topology neutral, there must be a system provided table of distances, from each processor to each bank of memory. The administrator requires these in terms of all the physical processor and memory bank names, likely visible in /proc. The application requires these in terms of the virtual processor and memory numbering visible to it. The virtual application distance information can be constructed on the fly by a user level library, using the information available from the system for both the distances between the physical resources, and for the mapping between the systems physical numbering and the applications virtual numbering.

  7. A way for Application to change setting from within.

    There must be a means for a process to change both the per-process task scheduling and the per-region memory allocation settings, to include changing which policy is in affect, tunable parameters of that policy, and in particular the specific vectors of "cpus_allowed" and some similar "mems_allowed". I propose that this is done using two new ulimit(2) options, one for processors and the other for memory banks, each providing for two values of that vector, one "max" value that can only be increased by root, and a second "current" value, respected by the process scheduler or memory allocator, and variable by the current process up to the limit of the "max" value. The ulimit setting for cpus_allowed would affect future scheduling decisions, and the ulimit setting for mems_allowed would affect the mems_allowed setting for future memory regions created by that process.

    It isn't clear that there is a clean and simple way, given the current system calls available, to spell the command to change the mems_allowed for allocation by an existing memory region. However the implementation of PageMigration that automatically migrates pages to keep them within their current CpuMemSet might be sufficient here.

    Various grand-daddy processes, such as init, inetd and batch schedulers, may provide configurable means, on certain platforms, to fine tune which processors and memory banks to allocate to which job or session.

  8. A way for a system utility to change from without.

    System administrators and services require a means for one process to affect the above settings for another process, in order to externally request migration of a process from one processor or memory bank to another. The permissions on changing the "current" values, within the constraints of the "max" setting, perhaps resemble the permissions required for one process to successfully signal (kill) another process. Presumably only root can change the "max" setting of another process.

Selected Details of Implementation

The following details sketch a simple implementation that provides the above capabilities.

  1. The cpumemset struct:
            struct cpumemset { 
                int physcpu[WORDSIZE];      # map phys cpu number to virtual 
                int physmem[WORDSIZE];      # map phys mem number to virtual 
            }; 
        
    A given cpumemset structure contains two arrays, one of them mapping up to WORDSIZE virtual cpu numbers to a corresponding physical cpu number, and the other doing the same for memory banks.

  2. The cpumemsets array:
            struct cpumemset *cpumemsets[]; # kernel's collection of cpumemsets 
        
    The kernel has a single global array of pointers to cpumemsets. Initially, at boot, the kernel constructs a single element of this array, cpumemsets[0], which contains the cpumemset for the up to the first 64 cpus and mems in the system.

    For single-cpu systems, this is the only possible cpumemset (though nothing prevents a system administrator or service from adding additional elements cpumemsets[] that describe the same cpumemset.)

    On larger multiprocessor systems, there might be 100's or 1000's of various cpumemsets[], describing various combinations of the available cpu's and mem's. Typically, though, I wouldn't expect that many cpumemsets[].

    The index of a particular cpumemset{} in this cpumemsets[] array is a globally visible quantity, with the same significance to any process in the system.

    Since it is not expected that this cpumemsets[] array will change that rapidly, after its initial construction, there is no apparent performance penalty to having it global, rather than per-node.

  3. Per task cpumemset index:
            int cpumemset; 
        
    Each task belongs to a cpumemset, and has in its task structure the index of that cpumemset in the cpumemsets[] global array.

  4. Per VirtualMemoryRegion cpumemset index:
            int cpumemset; 
        
    Each VirtualMemoryRegion belongs to a cpumemset, and has in its region structure the index of that cpumemset in the cpumemsets[] global array.

  5. Per task cpus_allowed:
            unsigned long max_cpus_allowed; 
            unsigned long cur_cpus_allowed; 
        
    Each task has max and cur cpus_allowed bit vectors. The task scheduler will only schedule a task on a given physical cpu if that cpu is listed in that tasks cpumemset physcpu[] array for some index and that same index is set in that tasks cur_cpus_allowed bit vector.

  6. Per VirtualMemoryRegion mems_allowed:
            unsigned long max_mems_allowed; 
            unsigned long cur_mems_allowed; 
        
    Each VirtualMemoryRegion has max and cur mems_allowed bit vectors. The memory allocator(s) will only allocate memory to a given region from a particular memory bank if that bank is listed in that regions cpumemset physmem[] array for some index, and that same index is set in that regions cur_mems_allowed bit vector.

  7. Initialization and Inheritance:

    The cpumemsets[] is initialized at boot as described above.

    Also during boot, the init (pid==1) process is assigned this first cpumemsets[] element, cpumemset == 0. Each time a process is forked, the child inherits its cpumemset value from its parent.

    Each time a new VirtualMemoryRegion is created, it inherits its cpumemset value from its creating process.

  8. System Call Operations Required:

  9. Other areas we likely need to change: