Over the last couple of weeks, as a result of the feedback
from Paul McKenney and some folks internal to SGI, I have
made the following changes to my CpuMemSets Design proposal.
Ideally, I would reissue the Design Notes now, but I am on
a six week sabbatical starting a few hours ago (well, almost
started ...').  Others at SGI are continuing with the further
design and implementation of this proposal.

1) In addition to <cpu, mem> distances, we also require <cpu,
   cpu> distances.  The <cpu, cpu> distance is a measure of
   how costly it would be due to caching affects to change the
   current cpu on which a task is executing (and has considerable
   cache presence) to the other cpu.  These distances reflect the
   impact of the system caches - two processors sharing a major
   cache are closer.  The scheduler should be more reluctant to
   reschedule to a cpu further away, and two tasks communicating
   via shared memory will want to stay on cpus that are close to
   each other, in addition to being close to the shared memory.

2) The constraint on 64 (or 32) cpus in any given cpumemset is
   unacceptable.  Experience with Irix on SGI's large SN-Mips
   systems includes working with large Fortran applications
   that spawn one thread per cpu, passing messages between the
   threads.  These applications might need to run on hundreds
   of cpus.

   Since such a large Fortran numeric application should not be
   running as root, this means that we require non-priviledged
   applications to be able to manipulate arbitrarily sized
   cpumemsets.

   The related ability of a priviledged root system service to be
   able to constrain a non-priviledged application to some subset
   (albeit possibly large) of the system cpus remains an important
   ability, however.

   This means that the max and cur settings as available from
   the get/setrlimit() system calls are still an important
   design notion, but that we cannot use these specific calls,
   because that interface is constrained to passing a value of
   but one word length.

   Hence either some new system call must be added, or else some
   more generic system call, such as perhaps prctl(), will have
   to be extended.  In either case, the system call interfaces
   for both the priviledged and non-priviledged calls will have
   to extend to hundreds or thousands of cpus.

   The original design goal of having at least the non-priviledged
   system calls have a clean interface, using an existing system
   call, but not one of the "Swiss Army" knife calls such as prctl,
   has _not_ been met.  Oh well.

3) The original Design Notes had excluded special purpose
   processors and memory (such as DMA engines or frame buffers)
   from consideration.  Some feedback had suggested that it was
   also important to support reporting relative distances to
   such special purpose ports.  But other feedback recommended
   that I/O needs and distances were so special purpose that
   it was unlikely that the general CpuMemSet mechanisms would
   be of much use.  After flip-flopping on this, I've decided
   to continue (for now, at least) to exclude special purpose
   nodes from this design.  I am not opposed to adding awareness
   of special purpose nodes to this design; but I haven't seen
   a proposal that seems worth while yet.  More likely, I suspect
   that a more complex hardware graph facility, combined with
   specialized ad hoc I/O optimizations, will succeed here.

4) The inner loop of the current Linux scheduler is not invoked
   by a ready-to-run task, looking for a suitable cpu on which to
   run, but rather by a freed-up cpu, looking for the best task
   to run on itself.

   This means that the kernel data structure suggested in my
   original Design, with a cur_cpus_allowed bit vector indexed
   by logical cpu number, and an array used to map logical (in
   that particular tasks view) cpu number to the physical cpu
   number (in the system view), was not efficient.  A key part
   of the scheduler inner loop, which currently is a couple of
   bit masks and tests, would have to be come a code loop over
   the logical to physcial array.  This would be an unacceptable
   hit on the scheduler performance.

   Two alternatives have been proposed to address this.

   a) The logical to physical map could be stored inverted in the
      kernel.  Instead of storing, for each cpumemset, the list
      of cpus contained therein, rather store, for each cpu,
      the cpumemsets[] index for each cpumemset containing that cpu.

      The impact on the scheduler inner loop code would be the
      addition of another bit mask, test and conditional jump.

   b) Move the logical to physical conversion out of the inner
      scheduler loop.  Instead of changing the cpus_allowed bit
      vector to be indexed by the task-relative logical cpu id,
      rather leave cpus_allowed _exactly_ as it is now, indexed
      by physical cpu id.  Whenever calls are made to manipulate
      (create, destroy, alter or reattach) cpumemsets, convert
      from logical to physical, with the necessary coding loop,
      at that time, updating the cpus_allowed bit vector.  This
      gets the conversion loop out of the critical code path.

      At some point, whenever someone is ready to push this code
      past 64 cpus, the current implementation of cpus_allowed as
      a single word bit vector will have to change, but that had
      to happen anyway.

   I have chosen (b), because it requires _no_ scheduler change,
   and because it is conceptually slightly simpler to keep the
   kernel cpumemsets[] data structure in the same form as it is
   manipulated by the system calls, rather than inverted.

   The feature of (b) above that it makes _no_ change to the
   current inner scheduler loop (the version of the scheduler
   with Ingo's cpus_allowed bit vector) is important.  I hope
   that we are able to merge this CpuMemSets design with the
   MultiQueue scheduler work coming from our friends at IBM,
   and the easiest patch to merge is the one that changes
   nothing.

                          I won't rest till it's the best ...
                          Manager, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373