NUMA Scheduler Michael - General difference between SMP and NUMA is that some memory is closer to some processors than others. Different NUMA archs have different latencies between processors. IBM NUMAQ is about 20 to 1 which is old. SGI and other NUMA archs are close to 3 or 4 to 1. But they can all benefit from a scheduler with NUMA awareness. Martin - Noticed some of Erich code touches SMP code which may or may not be helpful. Martin still wants to test it. Since the NUMA scheduler code did not get accepted into the 2.5 kernel yet Martin wanted to come up with a solution that wont touch anything else and is as small as possible in order to increase the chance of it still being accepted into mainline 2.5. Michael - Is there any way to piggyback hyperthreading support on top of your new mini numa scheduler. martin thinks that will be better to put in the original load balancing stuff. it would be better and spreading stuff out around different cpus. probably dont want to do the purley in node rebalancing in that case (hyperthreading). Want to check for unbalance every n ticks then rebalance if it is necessary. question - do you first choose the least balanced node then the least balanced processor? michael - yes. how do you deal with hyperthreaded processors on a NUMA box. Then would have multiple levels of balancing. which is not yet supported. it would be interesting to see what value there is. martin likes ingo one run que per two cpus but it never worked right apparently. erich wants someone to look into mulitlevel node support. he originall had a matrix of distances between nodes. michael was thinking of doing something where the closer nodes are rebalanced more frequently than the further nodes. martin said this is similar to his concentric circles rebalancing. erich count number of tries to rebalance in previous level. but martin has a problem with this why would it matter to rebalance accross nodes only if you are balanced within nodes. but that is a problem. want to rebalance across nodes even if internal nodes are already balanced. martin thinks you sould rebalance across nodes wether or not the internal nodes are balanced. erich doesnt want to get processes too far away from their originating node. lets say you have 4 tasks on one cpu then balance to get one task on each cpu then dont have anything to balance after that. andrew asked about a 3red level of balance. after across node balanceing if it is unsuccessful then try a 3rd level, ie super node. ie 4 nodes on one side and 4 nodes on another side. if one side is much more stressed than the other then try the other side. ie the 3rd level. martin wasnt sure if there are 3 level archs out there? erich said there are. martin asked if erich had code that instead of taking a static snapshot using an average of statistics of the load average. yes he does have some code for this. but it isnt per cpu it is system wide. martin wants something like this that works per cpu. michael proposed a pull method where every once in a while a node will check if it needs to take work from some other node. want to prevent bouncing processes between nodes unnecessarily. erich said they are currently doing two ways to kee that from happening. use a loop over variabls that contain number of running processes on each node. then when searching for the max it would be a different node than own node. if there is a serious imbalance of threshholds (ie 25%) then build new mask of cpu which we want to scan in find busiest queue of maybe only remote node mask. use historical nodes when going through find busiest nodes. first need to pass threshhold of 24% when compare cpu loads and taking previous nr running which is an average of old load.s martin agreed this is a good way to smooth things out. martin asked if you take the runq lock in order to steal a task, how about we take the one with the smallest memory set size. want to steal processes that are cache cold most importantly. martin said we need to be really careful about modifiing the standard scheduler code. erich thinks the code that decides what tasks to steal shold take the cache coldest processes not just ones that are cache cold. martin asked to see the code again that erich wrote for this. but wants to start with just really light code and if it gets into mainline then can make it better after that. erich agreed. erich and michael think it is important to put node affinity back in where a processes remembers where it started so it can be put back if it needs to be later. they all agreed this is important but it will have to wait until after the basic numa background rebalancer is in the mainline. martin thinks if we have a set of routines that only run on numa then we can get away with doing more things as long as they dont touch the normal smp scheduler code. erich will work on the internode balance code for this. should generally keep in mind that want to steal less if the node is far away. the critical thing is to get something in at this point. erich will throw out the loop over nodes since that is more complicated for this first pass. question - what benchmark they are using to simulate load. martin calls it schedbench but rusty called hackbench schedbench so now we have two. erich wants to call it numabench. it starts a few processes in parallel wich are memory and bandwith hungry. this way you will feel when the process is running with the wrong memory. the process user time will be longer if process is moved to different node than memory. the benchmark tells you what happened and gives you some statistics after runnin git a few times. erich ran it with martins mini scheduler which showed good results because processes cannot be moved off their original node. is there a recommended ratio of processes to processors to get a valid run. we are currenlty trying various configurations. with less tasks per node find out if initial load balancer is working. with more tasks per node might end up with processes moving away from their nodes. it might be useful if we did some runs with non even numbers of processes per node ie 1.5. michael did runs with 24 tasks per 16 cpus and it showed that processes do bounce too much between nodes. he wants to experiement with makig processes stickier per nodes. martin had some config ideas to set the threshhold based on different numa architectures. andrew is working on a hyperthreading topology patch. right now it shows about 4% improvement with kernbench on a lightly loaded system (1/2 # tasks as # of cpus). he wants to do more than one benchmark and do more verification. martin is going to send it out early next week probably to ask linus to take it. erich also wants to see rusty's schedbench run with this as well.