Implementation of a
Multi-Queue Scheduler
for
Linux

Mike Kravetz, Hubertus Franke
IBM Linux Technology Center
kravetz@us.ibm.com, frankeh@us.ibm.com

Version 0.2, April 2001


Introduction

As the title implies, this paper describes an implementation of a multi-queue scheduler for Linux. The current Linux scheduler is designed around a single run-queue which may become a bottleneck while running workloads with high task counts. In addition, systems with multiple CPUs may find the single run-queue lock a point of contention.

The purpose of this implementation is to explore what effect multiple run-queues have on the scalability of the Linux scheduler. The design goals of the implementation are the following:

This implementation of a multi-queue scheduler maintains a separate run-queue for each CPU in the system. Major code changes are pretty much isolated within the schedule() and reschedule_idle() routines. Algorithmic changes within these routines is discussed in detail in other sections of this document.

Data Structures

This implementation defines a new per-CPU run-queue data structure as follows:
/*
 * runqueue_data
 *      One data item per CPU in the system.  Size should be a multiple
 * of cache line size, and array of items should start on a cache line
 * boundary.
 */
typedef union runqueue_data {
        struct rq_data {
                int nt_running;                 /* # of tasks on runqueue */
                int max_na_goodness;            /* maximum non-affinity */
                                                /* goodness value of    */
                                                /* 'schedulable' task   */
                                                /* on this runqueue     */
                struct task_struct * max_na_ptr; /* pointer to task which */
                                                 /* has max_na_goodness   */
                unsigned long max_na_cpus_allowed; /* copy of cpus_allowed */
                                                   /* field from task with */
                                                   /* max_na_goodness      */
                struct list_head runqueue;      /* list of tasks on runqueue */
                int running_non_idle;           /* flag to indicate this cpu */
                                                /* is running something      */
                                                /* besides the idle thread   */
                spinlock_t runqueue_lock;       /* lock for this runqueue */
        } rq_data;
        char __pad [SMP_CACHE_BYTES];
} runqueue_data_t;
extern runqueue_data_t runqueue_data[];
#define runqueue_lock(cpu) runqueue_data[(cpu)].rq_data.runqueue_lock
#define runqueue(cpu) runqueue_data[(cpu)].rq_data.runqueue
#define max_na_goodness(cpu) runqueue_data[(cpu)].rq_data.max_na_goodness
#define max_na_cpus_allowed(cpu) \
        runqueue_data[(cpu)].rq_data.max_na_cpus_allowed
#define max_na_ptr(cpu) runqueue_data[(cpu)].rq_data.max_na_ptr
#define nt_running(cpu) runqueue_data[(cpu)].rq_data.nt_running
#define running_non_idle(cpu) runqueue_data[(cpu)].rq_data.running_non_idle
In addition to the the new run-queue data structure, new fields are added to the aligned_data data structure. The new definition of aligned_data is as follows:
/*
 * aligned_data
 *      CPU specific scheduling data.  One data item for each CPU
 * in the system.  Size should be a multiple of cache line size,
 * and array of items should start on a cache line boundary.
 */
typedef union aligned_data {
        struct schedule_data {
                struct task_struct * curr;      /* current task on this CPU */
                cycles_t last_schedule;         /* time of last scheduling */
                                                /* decision                */
#ifdef CONFIG_SMP
                int curr_na_goodness;           /* non-affinity goodness */
                                                /* value of current task */
#endif
        } schedule_data;
        char __pad [SMP_CACHE_BYTES];
} aligned_data_t;
extern aligned_data_t aligned_data[];
#define cpu_curr(cpu) aligned_data[(cpu)].schedule_data.curr
#ifdef CONFIG_SMP
#define curr_na_goodness(cpu) aligned_data[(cpu)].schedule_data.curr_na_goodness
#endif
#define last_schedule(cpu) aligned_data[(cpu)].schedule_data.last_schedule
The above declarations are made in the sched.h header file. The actual definitions (allocating storage) are made in the file sched.c as follows:
/*
 * We align per-CPU scheduling data on cacheline boundaries,
 * to prevent cacheline ping-pong.  Initialized in sched_init.
 */
aligned_data_t aligned_data [NR_CPUS] __cacheline_aligned;
runqueue_data_t runqueue_data [NR_CPUS+1] __cacheline_aligned;
The use of these new data structures and fields will be described in the following sections.

Associating a task with a run-queue

The task->processor field of a task structure is used to determine what run-queue a task is associated with. In addition to the CPU specific run-queues, there is a real-time run-queue where all real-time tasks (task->policy != SCHED_OTHER) are put. A simple routine is used to determine what run-queue a task is associated with. Such a routine is as follows:
static inline int task_to_runqueue(struct task_struct *t)
{
        int rq;

        if ((t->policy & ~SCHED_YIELD) != SCHED_OTHER) {
                rq = REALTIME_RQ;
        } else {
                rq = t->processor;
        }
        return(rq)
}
Associated with each of run-queue is a separate lock used to synchronize operations on that run-queue. In general, the scheduler code only needs to obtain the lock associated with the run-queue on which it is operating. Normally this is a CPU specific run-queue. Therefore, multiple instances of the scheduler can be running in parallel on separate CPUs.

In some cases it may be desirable to hold more than one CPU specific run-queue lock at a time. In these cases there is always a primary run-queue lock which must be held. While holding the primary lock, it may be desirable (but not absolutely necessary) to acquire a secondary run-queue lock. In such cases the spin_trylock() routine is used to avoid deadlock. Descriptions of this locking methodology will be evident in subsequent sections of this document.

Global Scheduling Decisions

The current Linux scheduler makes global scheduling decisions. It can easily do this because there is a single run-queue and single run-queue lock. Hence, after obtaining the run-queue lock the scheduler can scan the list of all running tasks in the system to make a global scheduling decision. In most multi-queue schedulers, scheduling decisions are made locally (based on data of a specific run-queue). However, to maintain a sense of system wide balance and fairness, sophisticated load balancing algorithms are employed to balance the load across the multiple run-queues.

In this implementation of a multi-queue scheduler, we attempt to preserve the global scheduling decisions/behavior of the existing Linux scheduler. Therefore, this new scheduler must have more global knowledge than a traditional multi-queue scheduler. To facilitate this global knowledge, we keep track of something called non-affinity goodness (na_goodness) for specific tasks in the system. na_goodness is defined as simply the goodness value of a task without any consideration for CPU or memory map affinity. In the current goodness() routine, the goodness value of a task is determined by:

Note that only the processor and mm fields are based on a task's affinity. Therefore, when calculating a task's na_goodness value we use the same algorithm as the current goodness() routine and simply ignore the processor and mm fields.
 

The uses of na_goodness

As can be seen in the Data Structures section above, na_goodness values are maintained for specific tasks in the system.  They are: Maintaining na_goodness values for these specific tasks allows this implementation of a multi-queue scheduler to make global scheduling decisions. When the schedule() routine is searching for the highest priority task in the system to run, it can easily check for high priority tasks on other run-queues by examining the max_na_goodness values of remote run-queues. When the reschedule_idle() routine is deciding what CPU a task should run on, it can use the na_goodness value of the currently running tasks to determine what if any running task should be preempted. The use of na_goodness values in the schedule() and reschedule_idle() routines is discussed in detail in the following sections.

schedule()

The main purpose of the schedule() routine is to determine what will be the next task to run on the current CPU. This routine is executable only within the context of a task (not within interrupt context). Upon entry to the routine, the current task is determined via the current macro, and the current CPU (this_cpu) is determined via the processor field of the current task (current->processor). The run-queue lock is then obtained. In the multi-queue scheduler, we only obtain the run-queue lock of the current CPU specific run-queue. This means that it is only safe to scan the entries of the current CPU specific run-queue. A new routine local_goodness() is used instead of goodness() when scanning CPU specific run-queues.  local_goodness() is essentially the same as goodness() except that there are no checks made for real time tasks or CPU affinity.  This is because we know there are no real time tasks on CPU specific run-queues (see Real Time Support section below), and all tasks on the run-queue have the same p->processor value. Also, while scanning the local run-queue, the code keeps track of the schedulable task with the next highest goodness value. The code looks something like the following:
        /*
         * Search CPU specific runqueue
         */
        list_for_each(tmp, &runqueue(this_cpu)) {
                p = list_entry(tmp, struct task_struct, run_list);
                if (local_can_schedule(p)) {
                        int weight = local_goodness(p, prev->active_mm);
                        if (weight > c) {
                                if (!next->has_cpu) {
                                        /*
                                         * prev_next must point to a
                                         * schedulable task.
                                         */
                                        prev_next_weight = c;
                                        prev_next = next;
                                }
                                c = weight;
                                next = p;
                        } else if (weight > prev_next_weight) {
                                prev_next_weight = weight;
                                prev_next = p;
                        }
                }
        }
Note that after the above search of the local CPU specific run-queue, the following variables are set: As discussed in previous sections, each run-queue data structure contains a field which contains the maximum na_goodness value of the highest priority schedulable task on that specific run-queue. Remember that this na_goodness value does not take into account the CPU affinity of a task.  However, the value c obtained via the search of our local CPU specific run-queue does contain a boost for CPU affinity. This boost is a constant value defined as PROC_CHANGE_PENELTY. The thought behind the PROC_CHANGE_PENELTY value is that a task on a remote CPU must have a goodness value high enough to overcome this value in order to be executed on another (remote) CPU.

To make a global scheduling decision, this multi-queue scheduler will examine the max_na_goodness values of all run-queues in the system. This examination is performed without holding the locks of remote run-queues. Since multiple examinations of these max_na_goodness values may be needed, we create a list of these values on the local stack. Code to create this list is something like the following:

        /*
         * As calculated above, c does not contain a CPU affinity boost.
         * We must add this boost before comparing to tasks on other
         * runqueues.  Only add PROC_CHANGE_PENALTY if c is a positive
         * goodness value.
         */
        if (c > 0) {
                c += PROC_CHANGE_PENALTY;
        }

        /*
         * Copy max_na_goodness values from CPU specific runqueue
         * structures to the list on our stack.
         */
        rrq = -1;
        tmp_na_goodness = c;
        for (rq = 0; rq < smp_num_cpus; rq++) {
                rcpu = cpu_logical_map(rq);
                if (rcpu == this_cpu) {
                        stack_list[rcpu] = NA_GOODNESS_INIT;
                        continue;
                }
                if (!this_cpu_allowed(max_na_cpus_allowed(rcpu), this_cpu)) {
                        stack_list[rcpu] = NA_GOODNESS_INIT;
                        continue;
                }
                if (max_na_goodness(rcpu) <= c) {
                        stack_list[rcpu] = NA_GOODNESS_INIT;
                        continue;
                }

                stack_list[rcpu] = max_na_goodness(rcpu);
                if (stack_list[rcpu] > tmp_na_goodness) {
                        rrq = rcpu;
                        tmp_na_goodness = stack_list[rcpu];
                }
        }
NA_GOODNESS_INIT is defined to be an artificially low value. When building this local list, we discard any values less than c (the highest goodness value on our local CPU specific run-queue with CPU affinity taken into account). In addition, we examine the cpus_allowed field and discard any tasks which are not allowed to run on this CPU. If no remote run-queue contains a max_na_goodness value which is higher than c, then we know that the task pointed to by next is the best choice for this CPU. However, if we do find a max_na_goodness value on a remote run-queue which is higher than c, then potentially there exists a task on another run-queue which is more desirable to run on this CPU. We choose the remote run-queue with the highest max_na_goodness value and do the following:
                /*
                 * First try to lock the remote runqueue and verify
                 * the max_na_goodness value.
                 */
                if (spin_trylock(&runqueue_lock(rrq))) {
                        if (max_na_goodness(rrq) > c &&
                            this_cpu_allowed(max_na_cpus_allowed(rrq),
                                                                this_cpu)) {
                                /*
                                 * Move a remote task to our run-queue
                                 */
                                if (!next->has_cpu) {
                                        prev_next = next;
                                }
                                next = max_na_ptr(rrq);
                                c = max_na_goodness(rrq);
                                del_from_runqueue(next);
                                next->processor = this_cpu;
                                add_to_runqueue(next);

                                /*
                                 * We have stolen a task from another
                                 * runqueue, quit looking.
                                 */
                                spin_unlock(&runqueue_lock(rrq));
                                break;
                        }
                        spin_unlock(&runqueue_lock(rrq));
                }
Note that we only attempt to acquire the remote run-queue lock via spin_trylock(). If we are unable to immediately obtain the lock, we ignore this remote run-queue in this scheduling decision. If we are able to obtain the remote run-queue's lock, we then need to verify that there is still a task with a sufficiently high max_na_goodness value that is allowed to run on the current CPU (since we can only be certain of obtaining accurate values while holding the lock). If these conditions are still met, the task pointed to by max_na_ptr (which has the high na_goodness value) is moved to the local run-queue. This task is then designated as the next task to run on this CPU.

If we are either unable to lock the remote run-queue or after obtaining the lock we determine that the remote run-queue does not contain a task meeting the necessary requirements, then the remote run-queue is ignored in this scheduling decision. In this case, the list of max_na_goodness values (residing on our stack) is searched to find the remote run-queue with the next highest max_na_goodness value. The process of trying to obtain the remote run-queue lock and verifying remote max_na_goodness (and cpus_allowed) values is then repeated for the remote run-queue with the next highest value. This process is repeated until either:

After choosing the next task to run and setting c to the goodness value of the next task, the schedule() routine continues processing in the same manner as it does today.

maintaining max_na_goodness and max_na_ptr

As shown above, each run-queue data structure contains the fields max_na_goodness and max_na_ptr. These fields can only be modified or (accurately) examined while holding the CPU specific run-queue lock. The fields are maintained in the following places:
  • schedule()  - Before dropping the CPU specific run-queue lock, these values are updated with the na_goodness value of the highest priority schedulable task. By definition the prev and next tasks are not schedulable because the has_cpu field is set in their task structure. While looking for the next task to run, schedule() keeps track of the next highest priority task on the local run-queue. If the next highest task on the local run-queue is the idle task, then max_na_goodness is set to NA_GOODNESS_INIT and max_na_ptr is set to NULL.
  • add_to_runqueue() - When adding a task to a CPU specific run-queue, a check is made to determine if the na_goodness value of the added task is greater than max_na_goodness for the run-queue.  If it is, then the fields are updated to reflect the values of the added task.
  • del_from_runqueue() - When removing a task from a CPU specific run-queue, a check is made to determine if max_na_ptr points to the task being removed. If it does, then max_na_ptr is set to NULL and max_na_goodness is set to NA_GOODNESS_INIT.
  • reschedule_idle()

    The purpose of the reschedule_idle() routine is to determine on what CPU a newly runnable task should execute. The routine is passed a pointer to a task structure (called the target task in this discussion) and may be called from both task and interrupt context. The lock of the CPU specific run-queue associated with the target task must be held upon entry to (and for the duration of) this routine. Unlike the current version of reschedule_idle(), it is not necessary that the target task be on a run-queue. This is because the target task will be put on the CPU specific run-queue of the target (most desirable) CPU.

    The first thing the reschedule_idle() routine does is check for a fast path of execution. This fast path can be taken if the CPU associated with the target task is idle. If the CPU is idle, then we ensure that the target task is on the CPU specific run-queue and inform the CPU that there is a task ready to be scheduled. This is exactly what the current reschedule_idle() code does today. Things get more complicated when this fast path is not taken and we must search for a target CPU on which to schedule the task.

    Like the schedule() routine, the reschedule_idle() routine would like to make a global scheduling decision. In the current version of the Linux scheduler, global decisions can easily be made because there is a global run-queue lock. While holding this global run-queue lock, it is possible to examine the state of all CPUs in the system. While the run-queue lock is held, the list of currently running tasks will remain constant. We are afforded no such luxury in this multi-queue scheduler implementation. Rather, we only know for certain the state of the CPU associated with the target task (because the associated run-queue lock is held).

    To make a global scheduling decision in reschedule_idle(), the state of all CPUs in the system must be examined. Such state information is contained in the list of CPU specific aligned_data data structures. Like the schedule() routine above, it is desirable to first examine these fields without holding CPU specific run-queue locks. Also, like the schedule() routine above we may wish to examine this data multiple times. Hence, we create a local list on our stack which contains this state information:

            /*
             * Create a list of current na_goodness values on our stack.
             * Only values less than the non-affinity goodness value of
             * p should be considered for preemption.
             */
            saved_na_goodness = na_goodness(p) - 1; /* preemption_goodness() > 1 */
            tmp_min_na_goodness = saved_na_goodness;
            curr_cycles = get_cycles();
            target_cpu = -1;
            for (i = 0; i < smp_num_cpus; i++) {
                    cpu = cpu_logical_map(i);
    
                    if (!can_schedule(p, cpu)) {
                            stack_list[cpu] = saved_na_goodness;
                            continue;
                    }
    
                    if (curr_na_goodness(cpu) == NA_GOODNESS_INIT) {
                            /*
                             * Indicates an idle task.  For idle tasks, determine
                             * the amount of time they have been idle.  Use the
                             * negative of this value in the list.  Hence, we
                             * first choose the CPU that has been idle the longest.
                             */
                            tmp_cycles = curr_cycles - last_schedule(cpu);
                            if (tmp_cycles > INT_MAX) {
                                    stack_list[cpu] = INT_MIN;
                            } else {
                                    stack_list[cpu] = (int)-tmp_cycles;
                            }
                    } else {
                            stack_list[cpu] = curr_na_goodness(cpu);
                            /*
                             * Add in PROC_CHANGE_PENALTY for remote CPUs
                             */
                            if (cpu != tsk_cpu) {
                                    stack_list[cpu] += PROC_CHANGE_PENALTY;
                            }
                    }
    
                    /*
                     * Look for the lowest value
                     */
                    if (stack_list[cpu] < tmp_min_na_goodness) {
                            target_cpu = cpu;
                            tmp_min_na_goodness = stack_list[cpu];
                    }
            }
    The values contained in this local list correspond to how appropriate it is to run the target task on the corresponding CPU. The lower the value, the more appropriate it is to run the target task on that CPU. Note that for idle CPUs a corresponding negative value is in the list. In fact, the longer a CPU has been idle the more negative the number. CPUs running non-idle tasks have positive numbers corresponding to the na_goodness value of the task currently running on that CPU. When scanning the list of values we only need to take note of values less than the na_goodness value of the target task. This is because the target task can only possibly preempt tasks with a lower goodness value.

    From the local list, the CPU associated lowest value is chosen and the following code is run.

                     if (target_cpu == tsk_cpu &&
                        preemption_goodness((tsk = cpu_curr(target_cpu)),
                                            p, target_cpu) > 1) {
                            /*
                             * If target_cpu is tsk_cpu, then no additional
                             * locking is required (we already have the CPU
                             * specific runqueue locked).  We also know that
                             * this CPU can not be idle, otherwise the fast
                             * path at the beginning of this routine would
                             * have been executed.  Therefore, simply send
                             * the IPI if required.
                             */
                            if (!task_on_runqueue(p)) {
                                    add_to_runqueue(p);
                            }
                            tsk = cpu_curr(target_cpu);
                            tsk->need_resched = 1;
                            if (target_cpu != this_cpu) {
                                    smp_send_reschedule(target_cpu);
                            }
                            return;
                    }
    
                    /*
                     * Try to lock runqueue and verify na_goodness value.
                     */
                    else if (spin_trylock(&runqueue_lock(target_cpu))) {
                            tsk = cpu_curr(target_cpu);
                            if ((tsk == idle_task(target_cpu)) ||
                                 (preemption_goodness(tsk, p, target_cpu) > 1)) {
                                    /*
                                     * Target CPU is idle, or it is running
                                     * a task with lower priority than p.
                                     * Therefore, move p to target runqueue.
                                     */
                                    if (task_on_runqueue(p)) {
                                            del_from_runqueue(p);
                                    }
                                    p->processor = target_cpu;
                                    add_to_runqueue(p);
    
                                    /*
                                     * Send and IPI to target CPU, unless the
                                     * CPU is idle and the need_resched flag
                                     * has already been set.
                                     */
                                    need_resched = tsk->need_resched;
                                    tsk->need_resched = 1;
                                    if ((target_cpu != this_cpu) &&
                                        ((tsk != idle_task(target_cpu)) ||
                                          !need_resched)){
                                            smp_send_reschedule(target_cpu);
                                    }
    
                                    spin_unlock(&runqueue_lock(target_cpu));
    
                                    return;
                            }
                            spin_unlock(&runqueue_lock(target_cpu));
                    }
    Note we only attempt to get the lock associated with a remote run-queue via spin_trylock(). If we are unable to immediately obtain the lock, we skip the remote CPU in this scheduling decision. After obtaining the lock we check to ensure that the remote CPU is either idle, or is running a task which the target task can preempt. If this is not the case, then the remote CPU is skipped in this scheduling decision. If the remote CPU looks like a good place for the target task, then we add the target task to the run-queue of the remote CPU and inform the remote CPU that it potentially needs to schedule the new task.

    If we skip a remote CPU for any of the above reasons, then we go on to the entry in our local list with the next lowest value and repeat the above process. When all CPUs with values less than the na_goodness value of the target task have been exhausted, then the target task will remain (or be put) on its original run-queue. A check will then be made to determine if the target task can preempt the task currently running on its original run-queue.

    maintaining curr_na_goodness

    As seen above, reschedule_idle() depends on the curr_na_goodness field of the aligned_data data structure to make scheduling decisions. This field is maintained in the following places:

    Miscellaneous Routines

    Some miscellaneous routines were added to support the multi-queue schedule implementation.  Some of the more notable are:

    Real-time Support

    Scheduling of real-time tasks have some important ordering constraints. These constraints are required to support the Round Robin and FIFO scheduling policies. As you can imagine it would be very difficult to maintain these ordering policies if real-time tasks were distributed among multiple run-queues. Therefore, a separate real-time run-queue is dedicated to real-time tasks. All real-time tasks (p->policy != SCHED_OTHER) reside on the real-time run-queue.

    The schedule() routine must take into account the real-time run-queue while searching for the next task to run. By definition, a real-time task has higher priority (goodness) than any other (SCHED_OTHER) task. Therefore, the schedule() routine must examine the real-time run-queue before examining its own CPU specific run-queue. If a schedulable task exists on the real-time run-queue, then that task must be chosen to run next. Ordering of real-time tasks within the real-time run-queue will remain as it is today. This preserves support for the required real-time scheduling policies.

    Like CPU specific run-queues, there is a lock associated with the real-time run-queue. However, unlike CPU specific run-queue locks there are situations where a CPU specific run-queue lock is held and the real-time run-queue lock must be obtained. In these situations we define a lock hierarchy and state that the CPU specific lock must be acquired before the real-time lock. Such a locking hierarchy prevents deadlocks while obtaining run-queue locks.

    Limitations of the Implementation

    The main purpose of this multi-queue scheduler implementation is to explore what impact the introduction of multiple run-queues will have on scalability. As a result, one goal was to maintain existing scheduler behavior/semantics. This goal is necessary to help isolate the cause of any performance/scalability changes caused by the implementation. If the implementation introduced multiple run-queues and significantly changed scheduler behavior, it may be difficult to say which had the greater impact: multiple queues or the change in behavior.

    As described in the sections above, this implementation attempts to make global scheduling decisions to emulate the behavior of the current scheduler. As a result, each scheduling decision requires examination of multiple instances of CPU specific data. It should therefore be obvious that as more CPUs are introduced into the system more data must be examined at each scheduling decision. Also, this CPU specific data is contained in separate CPU specific cache lines. This means that on every scheduling decision, each of these cache lines must be read by the CPU making the decision. As the number/rate of scheduling decisions increases, these cache lines will become a source of contention at the hardware memory management level. For these reasons, we know that the scalability of this implementation will eventually break down.