Scalable Statistics Counters

Dipankar Sarma <dipankar@in.ibm.com>
Ravikiran G. Thirumalai <kiran@in.ibm.com>
Paul E. McKenney <paul.mckenney@us.ibm.com>

Overview:

In-kernel counters are used to collect statistics of various events and render the same on requests (Eg., total packets transmitted from a network interface, context swithces, disk usage statistics, etc). These counters are usually not serialised by any mutex, since such counters need to be reasonably accurate only. Depending on the purpose and load conditons, these counters may be be updated very frequently. In this document, a new set of interfaces are proposed, which should improve performance of such counters in multiprocessor environments.

The basic idea behind the approach is to have per cpu versions of these counters and update per-cpu versions. The per cpu versions can be consolidated on a read request to the counter. This improves performance significantly in multiprocessor environments by reducing the bouncing of the counter cache lines among different CPUs. At the same time, this implementation switches to code that is completely devoid of overheads for SMP when compiled without CONFIG_SMP option set.

Interfaces:

Counters where exactness is not important (like statistics counters) can be declared as of type statctr_t. This is an opaque type and users of the counters do not need to bother about its fields. For an SMP kernel (CONFIG_SMP=y), this is a structure that has information which can be used to quickly compute the appropriate per-cpu counter. For uniprocessor kernel, this is just a unsigned long and thus there is no overhead.

	#ifdef CONFIG_SMP
	typedef struct statctr {
		pcpu_ctr_t *ctr;
	} statctr_t;
	#else
	typedef unsigned long statctr_t;
	#endif

int statctr_init(statctr_t *stctr, unsinged long value)
int statctr_ninit(statctr_t *stctr, int count, unsinged long value)

must

statctr_t

statctr_init()

must

statctr_cleanup()

statctr_ncleanup()

statctr_init()/statctr_ninit()

void statctr_cleanup(statctr_t *stctr)
void statctr_ncleanup(statctr_t *stctr, int count)

must

statctr_init()

statctr_ninit()

statctr_init()

statctr_ninit()

void statctr_inc(statctr_t *stctr)

void statctr_dec(statctr_t *stctr)

void statctr_add(statctr_t *stctr, unsigned long addend)

void statctr_sub(statctr_t *stctr, unsigned long subtrahend)

unsigned long statctr_read(statctr_t *stctr)

void statctr_set(statctr_t *stctr, unsigned long i)

statctr_read()

statctr_set()

Statistics Counter Maintenance:

In order to avoid sharing of cache lines between CPUs, the per-CPU versions of a statctr_t counter are located in different cache lines. However, since the counter size is a lot smaller than typical cache line sizes, different statistic counters interlaced their per-CPU versions and thus save on space.

             CPU #0                  CPU #1
	-----------------	-----------------       ^
	|		|	|		|	|
	|   counter #0  |       |  counter #0   |	|
	|		|	|		|	|
	-----------------	-----------------	|
	|		|	|		|	|
	|   counter #1  |       |  counter #1   |	|
	|		|	|		|	|
	-----------------	-----------------	|
	|		|	|		|	|
	|   counter #3  |       |  counter #2   |	|
	|		|	|		|
	-----------------	----------------- SMP_CACHE_BYTES
	|		|	|		|	
	|   counter #4  |       |  counter #4   |	|
	|		|	|		|	|
	-----------------	-----------------	|
	|		|	|		|	|
	|		|	|		|	|
	|		|	|		|	|
	-----------------	-----------------	|
               .                        .		|
               .                        .		|
               .                        .		|
               .                        .		|
               .                        .		|
               .                        .		v

We use a separate allocator to allocate and maintain interlaced per-cpu counters. Currently there are two implementations of this allocator - Fixed and Flexible. They differ in flexibility provided, specially with respect to handling addition or removal of CPUs in the system. The statctr implementation uses a common set of APIs to access the per-cpu counters irrespective of the underlying allocator. A per-cpu counter is represented by an opaque type pcpu_ctr_t. The APIs are -

pcpu_ctr_t *pcpu_ctr_alloc(int flag)

pcpu_ctr_t *pcpu_ctr_free(pcpu_ctr_t *ctr)

unsigned long *PCPU_CTR(pcpu_ctr_t *ctr, int cpu)

Fixed counters:


  my_struct
  ------------
  |          |
  | statctr0 |------|
  |          |      v
  -----------        -----------------    ^
  |          |       |	             |    |
  | statctr1 |----   |   counter #0  |    |
                 |   |	             |    |
                 --> -----------------    |
		     |		     | SMP_CACHE_BYTES
		     |   counter #1  | (for CPU #0)
		     |		     |    |
		     -----------------    |
		             .            |
		             .            |
		     |		     |    v
		     -----------------    ^
		     |		     |    |
		     |   counter #0  |    |
		     |		     |    |
		     -----------------    |
		     |		     | SMP_CACHE_BYTES
		     |   counter #1  | (for CPU #1)
		     |		     |    |
		     -----------------    |
		            .        
		            .       
		            .      
		            .     
		            .

Here a statistics counter declared and initialized using statctr_init() will be allocated from a smp_num_cpus * SMP_CACHE_BYTES sized block with each per-CPU version residing in a different cache line aligned chunk of SMP_CACHE_BYTES size. The next counter initialized will share the same block by placing the per-CPU versions interlaced. When one set of blocks get exhausted, another set is allocated.

The statctr_t opaque type contains a pointer to the CPU #0 version of the statistics counter. This pointer resides on line 0 of the statctr_t block (of size smp_num_cpus * SMP_CACHE_BYTES). The address of the per-cpu version of this counter is computed using pointer arithmatic.

typedef void *pcpu_ctr_t;

#define PCPU_CTR(ctr,cpu) \
       ((unsigned long *)(ctr + SMP_CACHE_BYTES * cpu))

static __inline__ void statctr_inc(statctr_t *stctr)
{
	(*PCPU_CTR(statctr->ctr, smp_processor_id()))++;
}

Because SMP_CACHE_BYTES will be a power of 2, reasonable compliers would generate code without using costlier instructions.

Flexible counters:

In order to make statctr work with CPU-hotplug features, we need to use a different approach to maintain per-cpu counters. For CPU online/offline to work with the Fixed counters, we have to allocate cacheline sized counter chunks (of size SMP_CACHE_BYTES) for NR_CPUS cpus which is likely to be a waste on many SMP machines. Instead, we use a flexible array based scheme that allows dynamic expansion of per-cpu chunks allocated for counters.

Each statctr is represented by a an array of size NR_CPUS that contains the pointers to the per-cpu versions of the counters. Initially we fill out the array with the pointers for the currently online CPUs. Onlining/Offlining would only requiring updates to the affected CPUs entry in this array. The data structures that contain the statctr will not need any modification and will continue to available to other CPUs for increment/decrement/read operations.


                  ---------------          ---------------
                  |             |          |             |
                  |    next     |--------->|     next    |--------->
                  |             |          |             |
		  ---------------          ---------------
                  |             |          |             |
                  |             |          |             |
                  |             |          |             |
                  ---------------          --------------
                         |                        |
   my_struct             |                        |
  -----------            |
  |         |            v
  | statctr |----->---------------
  |         |  ^  |              |
  ----------   |  |   CPU #0     |--------->-----------------  ^
  |         |  |  |              |          |                |  |
  |         |  |  ----------------          |                |  |
               |  |              |          |                |  |
               |  |   CPU #1     |----      |                |  |
          NR_CPUS |              |   |      |                | SMP_CACHE_BYTES
               |  ----------------   |      |                |  |
               |  |              |   |      |                |  |
               |  |   CPU #2     |   |      |                |  |
               |  |              |   |      |                |  v
               |  ----------------   ------>------------------  
               |  |              |          |                |
               |          .                 |                |
               |          .                 |                |
               |          .                 |                |
               |          .                 |                |
               v

The allocator maintains the counters in counter blocks. Each block consists of an array of size NR_CPUS with each element corresponding to an online CPU pointing to a cacheline sized, cacheline aligned counter chunk (of SMP_CACHE_BYTES). These chunks are allocated using the existing kmem_cache_alloc() slab allocator. The counter block also has a free list to allocate a single counter within itself.

The fast path code (increment/decrement) involves an extra load to get the given CPUs version of the counter, but it isn't likely to be as significant compared to the savings since the extra load will not result in cache line bouncing.


typedef struct pcpu_ctr_s {
       void *arr[NR_CPUS];     /* Pcpu counter array */
       void *blkp;             /* Pointer to block from which ctr was
                                  allocated from (for use with free code) */
} pcpu_ctr_t;

#define PCPU_CTR(ctr, cpuid) ((unsigned long *)ctr->arr[cpuid]) 

static __inline__ void statctr_inc(statctr_t *statctr)
{
	(*PCPU_CTR(statctr->ctr, smp_processor_id()))++;
}

We may maintain a list of all such counters allocated (linked list of the counter pointer-arrays). During onlining or offlining CPUs, this list is to be traversed and the corresponding pointer in the array updated. There is no locking here necessary since different CPUs will change their own pointers. Of course locking is needed for creating new counters.
If the number of CPUs in the system decreases, then we can live with letting the previously allocated cache lines be around unused. If the number of CPUs increase, then we need to allocate an additional cache line for each additional CPU.
This scheme will also handle sparse CPU numbers.

Results:

All measurements were done in a 8-way PIII 700MHz Xeon (1MB L2 cache) machine with 4GB RAM.

Microbenchmarking:

Microbenchmarking, while not indicative of any real workload, is still beneficial to verify if a certain implementation has inherent bottlenecks. So, here are the results of cost measurement for global counters vs. statctr averaged over 200 million operations.

Obvservations:

6 cycles per global counter increment includes the overhead of loop termination check and a delay of couple of cycles to allow cacheline requests from other CPUs to come in before the next increment. This is a more realistic scenerio.
Single processor cost (11 cycles) is greater for statctr, but the measurements were done using an SMP kernel. In UP kernel, statctrs are reduced to a long..
The cost of statctr while greater is negated by cacheline delays with global counters enough that we see benefits from 2 CPUs upwards for frequently used counters.
Other than negligible associativity misses, statctr implementation doesn't have any bad cache overhead.

Tbench over loopback:

The loopback driver uses 4 counters to maintain Tx/Rx statistics. This measurement used tbench over lo to compare relative advantage and disadvantage of using statctrs for these statistics. Amount of cycles needed to increment 4 counters as well as profile count for loopback_xmit() where measured.

Obvservations:

Cost of statctr is higher for 1 processor, but that is for SMP kernels, for UP kernels, statctr is a long.
Effect of cacheline bouncing in loopback_xmit under tbench load is sufficient enough to see benefit of statctr from 2 CPUs upwards.
The flattening of the cost curve from 2 CPUs upward is probably due to the fact that loopback_xmit() is serialized. The cacheline related delay is always limited to getting a modified cacheline and there aren't multiple CPUs trying to modify the cacheline simultaneously.

Download:

The statctr patches are available from Scalable Counters Package in LSE.