The basic idea behind the approach is to have per cpu versions of these counters and update per-cpu versions. The per cpu versions can be consolidated on a read request to the counter. This improves performance significantly in multiprocessor environments by reducing the bouncing of the counter cache lines among different CPUs. At the same time, this implementation switches to code that is completely devoid of overheads for SMP when compiled without CONFIG_SMP option set.
Counters where exactness is not important (like statistics counters)
can be declared as of type statctr_t. This is an opaque type
and users of the counters do not need to bother about its fields.
For an SMP kernel (CONFIG_SMP=y), this is a structure that
has information which can be used to quickly compute the appropriate
per-cpu counter. For uniprocessor kernel, this is just a
unsigned long and thus there is no overhead.
	#ifdef CONFIG_SMP
	typedef struct statctr {
		pcpu_ctr_t *ctr;
	} statctr_t;
	#else
	typedef unsigned long statctr_t;
	#endif
             CPU #0                  CPU #1
	-----------------	-----------------       ^
	|		|	|		|	|
	|   counter #0  |       |  counter #0   |	|
	|		|	|		|	|
	-----------------	-----------------	|
	|		|	|		|	|
	|   counter #1  |       |  counter #1   |	|
	|		|	|		|	|
	-----------------	-----------------	|
	|		|	|		|	|
	|   counter #3  |       |  counter #2   |	|
	|		|	|		|
	-----------------	----------------- SMP_CACHE_BYTES
	|		|	|		|	
	|   counter #4  |       |  counter #4   |	|
	|		|	|		|	|
	-----------------	-----------------	|
	|		|	|		|	|
	|		|	|		|	|
	|		|	|		|	|
	-----------------	-----------------	|
               .                        .		|
               .                        .		|
               .                        .		|
               .                        .		|
               .                        .		|
               .                        .		v
We use a separate allocator to allocate and maintain interlaced per-cpu counters. Currently there are two implementations of this allocator - Fixed and Flexible. They differ in flexibility provided, specially with respect to handling addition or removal of CPUs in the system. The statctr implementation uses a common set of APIs to access the per-cpu counters irrespective of the underlying allocator. A per-cpu counter is represented by an opaque type pcpu_ctr_t. The APIs are -
  my_struct
  ------------
  |          |
  | statctr0 |------|
  |          |      v
  -----------        -----------------    ^
  |          |       |	             |    |
  | statctr1 |----   |   counter #0  |    |
                 |   |	             |    |
                 --> -----------------    |
		     |		     | SMP_CACHE_BYTES
		     |   counter #1  | (for CPU #0)
		     |		     |    |
		     -----------------    |
		             .            |
		             .            |
		     |		     |    v
		     -----------------    ^
		     |		     |    |
		     |   counter #0  |    |
		     |		     |    |
		     -----------------    |
		     |		     | SMP_CACHE_BYTES
		     |   counter #1  | (for CPU #1)
		     |		     |    |
		     -----------------    |
		            .        
		            .       
		            .      
		            .     
		            .    
Here a statistics counter declared and initialized using 
statctr_init() will be allocated from a 
smp_num_cpus * SMP_CACHE_BYTES sized block with each per-CPU 
version residing in a different cache line aligned chunk of 
SMP_CACHE_BYTES size. The next counter initialized will share 
the same block by placing the per-CPU versions interlaced. 
When one set of blocks get exhausted, another set is allocated.
The statctr_t opaque type contains a pointer to the CPU #0 version of the statistics counter. This pointer resides on line 0 of the statctr_t block (of size smp_num_cpus * SMP_CACHE_BYTES). The address of the per-cpu version of this counter is computed using pointer arithmatic.
typedef void *pcpu_ctr_t;
#define PCPU_CTR(ctr,cpu) \
       ((unsigned long *)(ctr + SMP_CACHE_BYTES * cpu))
static __inline__ void statctr_inc(statctr_t *stctr)
{
	(*PCPU_CTR(statctr->ctr, smp_processor_id()))++;
}
Because SMP_CACHE_BYTES will be a power of 2, reasonable compliers would generate code without using costlier instructions.
                  ---------------          ---------------
                  |             |          |             |
                  |    next     |--------->|     next    |--------->
                  |             |          |             |
		  ---------------          ---------------
                  |             |          |             |
                  |             |          |             |
                  |             |          |             |
                  ---------------          --------------
                         |                        |
   my_struct             |                        |
  -----------            |
  |         |            v
  | statctr |----->---------------
  |         |  ^  |              |
  ----------   |  |   CPU #0     |--------->-----------------  ^
  |         |  |  |              |          |                |  |
  |         |  |  ----------------          |                |  |
               |  |              |          |                |  |
               |  |   CPU #1     |----      |                |  |
          NR_CPUS |              |   |      |                | SMP_CACHE_BYTES
               |  ----------------   |      |                |  |
               |  |              |   |      |                |  |
               |  |   CPU #2     |   |      |                |  |
               |  |              |   |      |                |  v
               |  ----------------   ------>------------------  
               |  |              |          |                |
               |          .                 |                |
               |          .                 |                |
               |          .                 |                |
               |          .                 |                |
               v
typedef struct pcpu_ctr_s {
       void *arr[NR_CPUS];     /* Pcpu counter array */
       void *blkp;             /* Pointer to block from which ctr was
                                  allocated from (for use with free code) */
} pcpu_ctr_t;
#define PCPU_CTR(ctr, cpuid) ((unsigned long *)ctr->arr[cpuid]) 
static __inline__ void statctr_inc(statctr_t *statctr)
{
	(*PCPU_CTR(statctr->ctr, smp_processor_id()))++;
}
All measurements were done in a 8-way PIII 700MHz Xeon (1MB L2 cache) machine with 4GB RAM.
|   |   | 
Obvservations:
|   |   | 
|   | 
Obvservations: