NUMA Scheduler

Michael -

	General difference between SMP and NUMA is that some memory is
closer to some processors than others. Different NUMA archs have different
latencies between processors. IBM NUMAQ is about 20 to 1 which is old. 
SGI and other NUMA archs are close to 3 or 4 to 1. But they can all
benefit from a scheduler with NUMA awareness.
	

Martin -

	Noticed some of Erich code touches SMP code which may or
may not be helpful. Martin still wants to test it. Since the
NUMA scheduler code did not get accepted into the 2.5 kernel
yet Martin wanted to come up with a solution that wont touch
anything else and is as small as possible in order to increase
the chance of it still being accepted into mainline 2.5.

Michael -

	Is there any way to piggyback hyperthreading support
on top of your new mini numa scheduler. martin thinks that
will be better to put in the original load balancing stuff.
it would be better and spreading stuff out around different
cpus. probably dont want to do the purley in node rebalancing
in that case (hyperthreading).  Want to check for unbalance 
every n ticks then rebalance if it is necessary. 

question - do you first choose the least balanced node
then the least balanced processor? michael - yes.
how do you deal with hyperthreaded processors on a 
NUMA box. Then would have multiple levels of balancing.
which is not yet supported. it would be interesting to
see what value there is.

martin likes ingo one run que per two cpus but it never
worked right apparently.

erich wants someone to look into mulitlevel node support.
he originall had a matrix of distances between nodes.
michael was thinking of doing something where the
closer nodes are rebalanced more frequently than 
the further nodes.

martin said this is similar to his concentric circles
rebalancing.

erich count number of tries to rebalance in previous
level. but martin has a problem with this why would
it matter to rebalance accross nodes only if you are
balanced within nodes. but that is a problem. want
to rebalance across nodes even if internal nodes are
already balanced. martin thinks you sould rebalance
across nodes wether or not the internal nodes are
balanced.

erich doesnt want to get processes too far away from their
originating node. lets say you have 4 tasks on one cpu
then balance to get one task on each cpu then dont have
anything to balance after that.

andrew asked about a 3red level of balance. after across
node balanceing if it is unsuccessful then try a 3rd 
level, ie super node. ie 4 nodes on one side and
4 nodes on another side. if one side is much more stressed
than the other then try the other side. ie the 3rd level.
martin wasnt sure if there are 3 level archs out there?
erich said there are.

martin asked if erich had code that instead of
taking a static snapshot using an average of
statistics of the load average. yes he does
have some code for this. but it isnt per cpu it
is system wide. martin wants something like this
that works per cpu.

michael proposed a pull method where every once
in a while a node will check if it needs to
take work from some other node. 

want to prevent bouncing processes between nodes
unnecessarily. erich said they are currently
doing two ways to kee that from happening.
use a loop over variabls that contain number
of running processes on each node. then when
searching for the max it would be a different
node than own node. if there is a serious imbalance
of threshholds (ie 25%) then build new mask of
cpu which we want to scan in find busiest queue
of maybe only remote node mask. use historical
nodes when going through find busiest nodes.
first need to pass threshhold of 24% when compare
cpu loads and taking previous nr running which is
an average of old load.s martin agreed this is a good
way to smooth things out.

martin asked if you take the runq lock in order to
steal a task, how about we take the one with the
smallest memory set size. want to steal processes
that are cache cold most importantly. 

martin said we need to be really careful about modifiing
the standard scheduler code. erich thinks the code
that decides what tasks to steal shold take the
cache coldest processes not just ones that are
cache cold. martin asked to see the code again that
erich wrote for this. but wants to start with just
really light code and if it gets into mainline then
can make it better after that.  erich agreed.

erich and michael think it is important to put
node affinity back in where a processes remembers
where it started so it can be put back if it needs
to be later. they all agreed this is important but
it will have to wait until after the basic numa
background rebalancer is in the mainline.

martin thinks if we have a set of routines that only
run on numa then we can get away with doing more 
things as long as they dont touch the normal
smp scheduler code.

erich will work on the internode balance code for this.

should generally keep in mind that want to steal less
if the node is far away. 

the critical thing is to get something in at this point. 

erich will throw out the loop over nodes since that is
more complicated for this first pass.

question - what benchmark they are using to simulate
load. martin calls it schedbench but rusty called
hackbench schedbench so now we have two. erich wants
to call it numabench.

it starts a few processes in parallel wich are memory
and bandwith hungry. this way you will feel when the
process is running with the wrong memory. the process
user time will be longer if process is moved to different
node than memory. the benchmark tells you what happened
and gives you some statistics after runnin git a few
times. erich ran it with martins mini scheduler which
showed good results because processes cannot be moved
off their original node.

is there a recommended ratio of processes to processors
to get a valid run. we are currenlty trying various
configurations. with less tasks per node find out if
initial load balancer is working. with more tasks per
node might end up with processes moving away from
their nodes.

it might be useful if we did some runs with non even
numbers of processes per node ie 1.5. michael did
runs with 24 tasks per 16 cpus and it showed that
processes do bounce too much between nodes. he wants
to experiement with makig processes stickier per
nodes. martin had some config ideas to set the threshhold
based on different numa architectures.

andrew is working on a hyperthreading topology patch.
right now it shows about 4% improvement with kernbench
on a lightly loaded system (1/2 # tasks as # of cpus). 
he wants to do more than one benchmark and do more verification.


martin is going to send it out early next week probably
to ask linus to take it. 

erich also wants to see rusty's schedbench run with this
as well.