Over the last couple of weeks, as a result of the feedback from Paul McKenney and some folks internal to SGI, I have made the following changes to my CpuMemSets Design proposal. Ideally, I would reissue the Design Notes now, but I am on a six week sabbatical starting a few hours ago (well, almost started ...'). Others at SGI are continuing with the further design and implementation of this proposal. 1) In addition to distances, we also require distances. The distance is a measure of how costly it would be due to caching affects to change the current cpu on which a task is executing (and has considerable cache presence) to the other cpu. These distances reflect the impact of the system caches - two processors sharing a major cache are closer. The scheduler should be more reluctant to reschedule to a cpu further away, and two tasks communicating via shared memory will want to stay on cpus that are close to each other, in addition to being close to the shared memory. 2) The constraint on 64 (or 32) cpus in any given cpumemset is unacceptable. Experience with Irix on SGI's large SN-Mips systems includes working with large Fortran applications that spawn one thread per cpu, passing messages between the threads. These applications might need to run on hundreds of cpus. Since such a large Fortran numeric application should not be running as root, this means that we require non-priviledged applications to be able to manipulate arbitrarily sized cpumemsets. The related ability of a priviledged root system service to be able to constrain a non-priviledged application to some subset (albeit possibly large) of the system cpus remains an important ability, however. This means that the max and cur settings as available from the get/setrlimit() system calls are still an important design notion, but that we cannot use these specific calls, because that interface is constrained to passing a value of but one word length. Hence either some new system call must be added, or else some more generic system call, such as perhaps prctl(), will have to be extended. In either case, the system call interfaces for both the priviledged and non-priviledged calls will have to extend to hundreds or thousands of cpus. The original design goal of having at least the non-priviledged system calls have a clean interface, using an existing system call, but not one of the "Swiss Army" knife calls such as prctl, has _not_ been met. Oh well. 3) The original Design Notes had excluded special purpose processors and memory (such as DMA engines or frame buffers) from consideration. Some feedback had suggested that it was also important to support reporting relative distances to such special purpose ports. But other feedback recommended that I/O needs and distances were so special purpose that it was unlikely that the general CpuMemSet mechanisms would be of much use. After flip-flopping on this, I've decided to continue (for now, at least) to exclude special purpose nodes from this design. I am not opposed to adding awareness of special purpose nodes to this design; but I haven't seen a proposal that seems worth while yet. More likely, I suspect that a more complex hardware graph facility, combined with specialized ad hoc I/O optimizations, will succeed here. 4) The inner loop of the current Linux scheduler is not invoked by a ready-to-run task, looking for a suitable cpu on which to run, but rather by a freed-up cpu, looking for the best task to run on itself. This means that the kernel data structure suggested in my original Design, with a cur_cpus_allowed bit vector indexed by logical cpu number, and an array used to map logical (in that particular tasks view) cpu number to the physical cpu number (in the system view), was not efficient. A key part of the scheduler inner loop, which currently is a couple of bit masks and tests, would have to be come a code loop over the logical to physcial array. This would be an unacceptable hit on the scheduler performance. Two alternatives have been proposed to address this. a) The logical to physical map could be stored inverted in the kernel. Instead of storing, for each cpumemset, the list of cpus contained therein, rather store, for each cpu, the cpumemsets[] index for each cpumemset containing that cpu. The impact on the scheduler inner loop code would be the addition of another bit mask, test and conditional jump. b) Move the logical to physical conversion out of the inner scheduler loop. Instead of changing the cpus_allowed bit vector to be indexed by the task-relative logical cpu id, rather leave cpus_allowed _exactly_ as it is now, indexed by physical cpu id. Whenever calls are made to manipulate (create, destroy, alter or reattach) cpumemsets, convert from logical to physical, with the necessary coding loop, at that time, updating the cpus_allowed bit vector. This gets the conversion loop out of the critical code path. At some point, whenever someone is ready to push this code past 64 cpus, the current implementation of cpus_allowed as a single word bit vector will have to change, but that had to happen anyway. I have chosen (b), because it requires _no_ scheduler change, and because it is conceptually slightly simpler to keep the kernel cpumemsets[] data structure in the same form as it is manipulated by the system calls, rather than inverted. The feature of (b) above that it makes _no_ change to the current inner scheduler loop (the version of the scheduler with Ingo's cpus_allowed bit vector) is important. I hope that we are able to merge this CpuMemSets design with the MultiQueue scheduler work coming from our friends at IBM, and the easiest patch to merge is the one that changes nothing. I won't rest till it's the best ... Manager, Linux Scalability Paul Jackson 1.650.933.1373