Linux Priority-Queue Scheduler

Implementation of a
CPU Pooling and Loadbalancing Extension
to the
Linux MultiQueue Scheduler

Hubertus Franke, Shailabh Nagar, Mike Kravetz, Rajan Ravindran
IBM Linux Technology Center
{frankeh,nagar,kravetz,rajancr}@us.ibm.com
Version 0.3, April 2001

The purpose of this implementation is to introduce CPU pooling and loadbalancing extensions to the proposed MultiQueue scheduler (MQ).

In the current MQ as well as in the vanilla Linux scheduler, the scheduler searches the entire set of processors at reschedule_idle() time (i.e. at process creation or wakeup time) to establish any potential processor to preempt. During the schedule(), the scheduler determines whether any remote processor has a task waiting for execution in its local runqueue with a goodness value that is higher by at least PROC_CHANGE_PENALTY when compared to the best local schedulable task.

In a pool based MQ the processors are divided into multiple processor sets. For instance a 8-way SMP can be divided into different configurations such as 8x1, 4x2, or 2x4 (pool x poolsize). By introducing processor sets, we can potentially eliminate traversing the entiry set of processor during scheduling times in order to place, preempt or steal from a process from/to another processor. However, we maintain an bitmap of processes that are not idle to quickly identify whether there are idle processors in the system.
A welcome side effect is that processes might achieve a stronger affinity within their local pool. A negative side effect is that the system might become inbalanced. We therefore introduce the concept of load balacing wherein periodically processes are moved between runqueues in order to equalize the lenght of their runqueues.

CPU Pooling

Data structures and variables for pooling within the MultiQueue required the addition of first_cpu and last_cpu within the aligned_data variables:

       typedef union aligned_data {
           struct schedule_data {
               struct task_struct * curr; /* current task on this CPU */
               cycles_t last_schedule; /* time of last scheduling decision */
       #ifdef CONFIG_SMP
               int curr_na_goodness; /*non-affinity goodness value of current task */
               int first_cpu; /* first and last cpu in */
               int last_cpu; /* the cpu pools */
               int resched_first_cpu; /* first and last cpu in */
               int resched_last_cpu; /* the cpu pools */
               unsigned long pool_mask;        /* poolmask ids phys cpus */
       #endif
           } schedule_data;
           char __pad [SMP_CACHE_BYTES];
       } aligned_data_t;
       extern aligned_data_t aligned_data[];

       #define cpu_pool_first(cpu) aligned_data[(cpu)].schedule_data.first_cpu
       #define cpu_pool_last(cpu) aligned_data[(cpu)].schedule_data.last_cpu
       #define cpu_pool_resched_first(cpu) aligned_data[(cpu)].schedule_data.resched_first_cpu
       #define cpu_pool_resched_last(cpu) aligned_data[(cpu)].schedule_data.resched_last_cpu
       #define cpu_pool_mask(cpu) aligned_data[(cpu)].schedule_data.pool_mask

cpu_pool_first(cpu) and cpu_pool_last(cpu) gives the pool boundaries of a cpu at any given instance. By default the system creates pools of size 1 (due to the fact that at sched_init() time, the variable smp_num_cpus is not yet set). The pool size can be set to smp_num_cpus by calling sched_init_pools(). Currently the sched_init_pools() call is only activated in the x86 architecture in arch/i386/kernel/smpboot.c before smp_commenced is raised. For other architectures, one needs to identify the appropriate place to call this function or live with the initial settings.and is set during scheduler initialization via the sched_init_pools().

The poolsize can be set (see non-x86 limitation above) at boottime by specifying the new "poolsize=<num>" boot parameter.
Dynamically modifying the poolsize and consequently the first_cpu and last_cpu members is integrated into the load balancing code (see below).

In this implementation we limit the search in schedule()to the local pool only. We are replacing the loop that builds the stacklist and traverses the entire processor set with one that is limited to the

tmp_na_goodness = c;
rqfirst = cpu_pool_first(this_cpu);
rqlast = cpu_pool_last(this_cpu);
for (rq = rqfirst; rq < rqlast; rq++) {
rcpu = cpu_logical_map(rq);
/* .. build stack list .. */
}

Similarly in reschedule_idle() we look at the cpu set that spans
[ cpu_pool_resched_first(cpu) .. cpu_pool_resched_last(cpu)-1].

By default these are specified to be 0..smp_num_cpus but can be overwriten by the load balancing code below to only look at the range of the specified cpupool.

Should the range be limited in reschedule_idle(), then there is the potential of leaving idle processors idle, which is in general a bad idea. We therefore, introduced a specific idle hunt/serach to quickly identify idle processors. We maintain a node_idle_bits bitmask, in which we clear /set the cpu corresponding bit in schedule() just before switching to a new task. This is synonym to the test (cpu_curr(cpu) == idle_task(cpu)).

In reschedule_idle() we first try to identify an idle cpu within our local pool utilizing the cpu_pool_mask(cpu) and if non is found we look for one in the pool-remote cpus.

        local_idle_bits = node_idle_bits;
        if (local_idle_bits == ~0L)
                goto regular_hunt;

/* check for local idle */
:

    idle_hunt:
        /* we know that we will never inspect tsk_cpu
         * and therefore don't have to deal with the double locking
         * issues.
         */
        while ( (curbits = (localmask | local_idle_bits)) != ~0L) {
                cpu = ffz(curbits);
                if (!can_schedule(p, cpu)) {
                        set_bit(cpu,&localmask);
                        continue;
                }
                if (spin_trylock(&runqueue_lock(cpu))) {
                    /* preempt the current task */
                    :
                    :
                    spin_unlock(&runqueue_lock(cpu));
                }
                set_bit(cpu,&localmask);
        }
        if (search_remote_idle) {
                localmask = ~cpu_pool_mask(this_cpu);
                search_remote_idle = 0;
                goto idle_hunt;
        }

regular_hunt:

LoadBalancing

LoadBalancing within in the MQ scheduler is an attempt to balance the load on all CPUs. For simplicity we assume that the load on a CPU's runqueue is defined by its length.We implemented a loadable kernel module that performs loadbalancing for a pool-based MQ scheduler. It also initializes or reinitializes the pool-based MQ.

Compile the module as follows:
cmd> gcc -I/usr/src/linux/include -DMODULE -D__KERNEL__ -DLINUX -DCONFIG_SMP -Wall -c loadbalance.c

The module requires two input parameters that need to be specified at loading time:

cpu_per_pool: defines the number of cpus per pool. If this parameter is not specified, the module leaves the current setting as is.
reschedall: defines whether we are searching all cpus for preemption or just local pool (idle is checked on all)
refreshticks: defines the time in 100msec ticks when loadbalancing should be performed. If "0" is specified for the refreshticks, then no loadbalancing is performed.

Example loading session introduces a systems with 2 cpus per pool and an invocation of the loadbalancing every 500msecs
cmd> insmod loadbalance.o cpu_per_pool=2 reschedall=0 refreshticks=50

Removing reverts the pool-based MQ back to a standard MQ scheduler by removing the module
cmd> rmmod loadbalance.o

A loadbalancing is initiated every (refreshticks*10)msecs performs the following actions:

determine the average runqueue length AL in entire system
balance each pool with AL as a target. As a result, tasks are moved between queues in the same pool in an attempt to achieve AL as the queue length. At the current time we do not
make any attempts to make any intelligent decisions on which tasks to move. Instead we go in FIFO order of the local runqueue.
all queues that could not be balanced within their respective pool are balanced against each other. Here we are assured that tasks only move from
pool to pool but not within a pool.

The /proc/schedinfo filesystem

The module also exports a /proc/schedinfo filesystem for obtaining status information on the runqueues and modifying the module parameters above dynamically at runtime.

Printing out status information is possible by cat-ing the proc file. The first line shows the poolsize and the refreshticks settings. The following lines show the [ logical cpu_number : cpu_load nt_running cpu_load/nt_running ]. Since in the current active implementation the load_value per process in the run queue is "1", the cpu_load and the nt_running column will be equivalent and the ratio will be always one. However, the code is setup to be able to experiment with different load_values.

cmd> cat /proc/schedinfo
4 1 50
0:    4    4    1
1:    2    2    1
2:    3    3    1
3:    1    1    1
4:    2    2    1
5:    2    2    1
6:    2    2    1
7:    2    2    1

Setting the poolsize and refreshrate at runtime:

cmd> echo "2 1 5" > /proc/schedinfo