Hubertus Franke, Shailabh Nagar, Mike Kravetz, Rajan Ravindran
IBM Linux Technology Center
{frankeh,nagar,kravetz,rajancr}@us.ibm.com
Version 0.3, April 2001
The purpose of this implementation is to introduce CPU pooling and loadbalancing extensions to the proposed MultiQueue scheduler (MQ).
In the current MQ as well as in the vanilla Linux scheduler, the scheduler searches the entire set of processors at reschedule_idle() time (i.e. at process creation or wakeup time) to establish any potential processor to preempt. During the schedule(), the scheduler determines whether any remote processor has a task waiting for execution in its local runqueue with a goodness value that is higher by at least PROC_CHANGE_PENALTY when compared to the best local schedulable task.
In a pool based MQ the processors are divided into multiple processor
sets. For instance a 8-way SMP can be divided into different configurations
such as 8x1, 4x2, or 2x4 (pool x poolsize). By introducing processor sets,
we can potentially eliminate traversing the entiry set of processor
during scheduling times in order to place, preempt or steal from a process
from/to another processor. However, we maintain an bitmap of processes
that are not idle to quickly identify whether there are idle processors
in the system.
A welcome side effect is that processes might achieve a stronger affinity
within their local pool. A negative side effect is that the system
might become inbalanced. We therefore introduce the concept of load balacing
wherein periodically processes are moved between runqueues in order to
equalize the lenght of their runqueues.
Data structures and variables for pooling within the MultiQueue
required the addition of first_cpu and last_cpu within the aligned_data
variables:
typedef union aligned_data {
struct schedule_data {
struct task_struct * curr; /* current task on this CPU */
cycles_t last_schedule; /* time of last scheduling decision */
#ifdef CONFIG_SMP
int curr_na_goodness; /*non-affinity goodness value of current task */
int first_cpu; /* first and last cpu in */
int last_cpu; /* the cpu pools */
int resched_first_cpu; /* first and last cpu in */
int resched_last_cpu; /* the cpu pools */
unsigned long pool_mask; /* poolmask
ids phys cpus */
#endif
} schedule_data;
char __pad [SMP_CACHE_BYTES];
} aligned_data_t;
extern aligned_data_t aligned_data[];
#define cpu_pool_first(cpu) aligned_data[(cpu)].schedule_data.first_cpu
#define cpu_pool_last(cpu) aligned_data[(cpu)].schedule_data.last_cpu
#define cpu_pool_resched_first(cpu) aligned_data[(cpu)].schedule_data.resched_first_cpu
#define cpu_pool_resched_last(cpu) aligned_data[(cpu)].schedule_data.resched_last_cpu
#define cpu_pool_mask(cpu) aligned_data[(cpu)].schedule_data.pool_mask
cpu_pool_first(cpu) and cpu_pool_last(cpu) gives the pool boundaries of a cpu at any given instance. By default the system creates pools of size 1 (due to the fact that at sched_init() time, the variable smp_num_cpus is not yet set). The pool size can be set to smp_num_cpus by calling sched_init_pools(). Currently the sched_init_pools() call is only activated in the x86 architecture in arch/i386/kernel/smpboot.c before smp_commenced is raised. For other architectures, one needs to identify the appropriate place to call this function or live with the initial settings.and is set during scheduler initialization via the sched_init_pools().
The poolsize can be set (see non-x86 limitation above) at boottime by
specifying the new "poolsize=<num>" boot parameter.
Dynamically modifying the poolsize and consequently the first_cpu
and last_cpu members is integrated
into the load balancing code (see below).
In this implementation we limit the search in schedule()to the local pool only. We are replacing the loop that builds the stacklist and traverses the entire processor set with one that is limited to the
tmp_na_goodness = c;Similarly in reschedule_idle() we look at the cpu set that spans
rqfirst = cpu_pool_first(this_cpu);
rqlast = cpu_pool_last(this_cpu);
for (rq = rqfirst; rq < rqlast; rq++) {
rcpu = cpu_logical_map(rq);
/* .. build stack list .. */
}
By default these are specified to be 0..smp_num_cpus but can be overwriten by the load balancing code below to only look at the range of the specified cpupool.
Should the range be limited in reschedule_idle(), then there is the potential of leaving idle processors idle, which is in general a bad idea. We therefore, introduced a specific idle hunt/serach to quickly identify idle processors. We maintain a node_idle_bits bitmask, in which we clear /set the cpu corresponding bit in schedule() just before switching to a new task. This is synonym to the test (cpu_curr(cpu) == idle_task(cpu)).
In reschedule_idle() we first try to identify an idle cpu within our local pool utilizing the cpu_pool_mask(cpu) and if non is found we look for one in the pool-remote cpus.
local_idle_bits = node_idle_bits;
if (local_idle_bits == ~0L)
goto regular_hunt;
/* check for local idle */
:
idle_hunt:
/* we know that we will never inspect tsk_cpu
* and therefore don't have to deal with the double locking
* issues.
*/
while ( (curbits = (localmask | local_idle_bits)) != ~0L) {
cpu = ffz(curbits);
if (!can_schedule(p, cpu)) {
set_bit(cpu,&localmask);
continue;
}
if (spin_trylock(&runqueue_lock(cpu))) {
/* preempt the current task */
:
:
spin_unlock(&runqueue_lock(cpu));
}
set_bit(cpu,&localmask);
}
if (search_remote_idle) {
localmask = ~cpu_pool_mask(this_cpu);
search_remote_idle = 0;
goto idle_hunt;
}
regular_hunt:
LoadBalancing within in the MQ scheduler is an attempt to balance
the load on all CPUs. For simplicity we assume that the load on a CPU's
runqueue is defined by its length.We implemented a loadable kernel module
that performs loadbalancing for a pool-based MQ scheduler. It also initializes
or reinitializes the pool-based MQ.
Compile the module as follows:
cmd> gcc -I/usr/src/linux/include
-DMODULE -D__KERNEL__ -DLINUX -DCONFIG_SMP -Wall -c loadbalance.c
The module requires two input parameters that need to be specified at loading time:
Removing reverts the pool-based MQ back to a standard MQ scheduler by
removing the module
cmd> rmmod loadbalance.o
A loadbalancing is initiated every (refreshticks*10)msecs performs the following actions:
Printing out status information is possible by cat-ing the proc file. The first line shows the poolsize and the refreshticks settings. The following lines show the [ logical cpu_number : cpu_load nt_running cpu_load/nt_running ]. Since in the current active implementation the load_value per process in the run queue is "1", the cpu_load and the nt_running column will be equivalent and the ratio will be always one. However, the code is setup to be able to experiment with different load_values.
cmd> cat /proc/schedinfo
4 1 50
0: 4
4 1
1: 2
2 1
2: 3
3 1
3: 1
1 1
4: 2
2 1
5: 2
2 1
6: 2
2 1
7: 2
2 1
Setting the poolsize and refreshrate at runtime:
cmd> echo "2 1 5" > /proc/schedinfo