Read-Copy Update Implementations

Read-Copy update essentially consists of two phases -

Remove global referrences to the protected data structure
When no local pointer can be remaining (and hence safe), update the data structure.

The first phase is fairly straightforward, it is the implementation of the second phase where the current algorithms differ. One simple way (described in the OLS2001 paper) is to keep rescheduling the task needing to update until it has run on all the CPUs. This ensures that all the CPUs have lost reference to the protected data and update can now be done. The obvious drawback is that update path is fraught with overheads - additional context switches, waiting for the "grace period" to complete etc. An alternative approach is to batch the updates and complete them asynchronously when it is safe to do so. This the approach that we chose for our experiments. The implementations that we did essentially follow this logic, but vary on how they determine that it is safe to complete the update (end of grace period). The policies for determining grace period are -

Batch update requests. For each batch wait for all the CPUs to go through a quiescent state by checking for it in each CPU at a regular but not so frequent interval, say 10 or 50 milliseconds. When all CPUs check in, the batch is ready to be dispatched.
Batch update requests, but instead of periodic checking force all CPUs to go through a quiescent state immediately, typically by forcing a reschedule in all CPUs.

Description of various implementations:

1. X-rcu

This is a experimental implementation that uses features likely to be available in future kernels. Using these features, this implementation leads to simple code. Some of the features used by this are -

Per-CPU data area (by Rusty) - this cleans up the maintainance of per-CPU RCU data and allows splitting of the context switch counter from the rest. By having the context switch counter separate, we can avoid nasty header file cyclical dependencies with interrupt.h, fs.h and sched.h. Basically, it allows rcupdate.h to be free of interrupt.h and included in low level kernel subsystem where RCU might be used (dcache.h for example).
Per-CPU timer support (Ingo's smptimers patch enhanced to ensure that timers queued in a CPU always gets executed in the same CPU). This allows checking of quiescent state status for each CPU to be done in a clean and arch-independent way. No kernel daemons, no overhead of scheduling these tasks unlike other implementations that use kernel daemons.

The basic algorithm is same as the original ptx algorithm -

It uses per-CPU queues to queue callbacks to be invoked when it is safe. A timer is registered for each CPU and each of them check for quiescent states at an infrequent interval. For a batch of callbacks, when all CPUs check in, the grace period is over and it is safe to invoke the callbacks in the batch. The modified smptimers patch guarantees that timer queued in a CPU gets fired in the same CPU (Ingo's smptimers doesn't). This allows us to emulate the ptx algorithm without modifying arch-dependent code (CPU-local timer).

2. rcu

It is also based on the old ptx algorithm. The salient features of this implementation are -

Uses per-CPU queues of callbacks
Periodic checking for quiescent state is done by running a set of per-CPU kernel daemons (krcud). A periodic timer wakes up the krcuds periodically who then check if the corresponding CPUs have gone through a quiescent state. If, for a single batch of callbacks, all the CPUs have checked in, then the processing of update callbacks are also done.
An active cpu mask is maintained in such a way that the periodic timer happens only if there is any active CPU. An active CPU, in this case, is one that has either callbacks queued or has not yet quiesced to complete the last known grace period. So, if there is no RCU in the system, there is no overhead other than a per-CPU context switch counter increment in schedule().

3. rcu_poll

This patch uses the second policy to determine the grace period. For simplicity, it uses global batches of RCU callbacks protected by a global lock. The assumption here is that RCU will be a very infrequent event. For any batch of callbacks, it forces a reschedule on all CPUs thereby immediately completing a grace period. In the meanwhile, a tasklet is repeatedly scheduled to keep checking for the completion of the grace period. The basic algorithm is this -

	For RCU, queue the callback and start the RCU tasklet
	In the RCU tasklet, 
		if one batch is not waiting for
			grace period to be over, start the grace period
			for the new batch, force reschdule on all the CPUs.
			Continue to reschedule the tasklet until all the
			CPUs have checked in for the grace period.
		else 
			Check if all the CPUs have checked in for the
			current grace period and process accordingly.
		
		[ If all CPUs check in, the grace period is over
		  and current batch of callbacks are invoked.
		  Checking for grace period is done by monitoring
		  a per-CPU context switch counter.
		]

If there is no RCU pending, then there is no overhead other than the per-CPU context switch counter increment.

4. rcu_ltimer

This very similar to the original ptx implementation. It uses the per-CPU local timer handler to check if the corresponding CPU has quiesced or not. The rest is handled pretty much like the rcu implementation. Whether there is RCU pending or not, the checking is always done. Quiescent CPU checking is done using a per-CPU context switch counter.

5. rcu_taskq

This uses the second policy of determining grace period - it forces the CPUs to quiesce and complete the grace period quickly. This is achieved using a set of per-CPU daemons (krcud) that sleep until woken up to put the CPUs through quiescent state. The callback queues are global (as opposed to per-CPU). A bottom half handler (taskqueue) is used to wake up the krcuds. Once done, the grace period is over and callbacks can now be invoked.

A major problem with this implementation is that keventd (used for taskqueue implementation) can get starved since it runs at a very low priority. That may cause RCU to prolong the grace period which may in turn increase the memory pressure on the system.

6. rcu_sched [Rusty]

Implemented by Rusty Russell, this algorithm uses a ring of per-cpu counters which are handled during context switch if there is any RCU pending in the system. Each CPU's counter is set to one greater than its neighbor. When the neighbor's counter change it is guranteed that all the cpus have passed through a context switch. rcu_sched optimizes queueing of callbacks by maintaining separate queues for different interrupt levels. It also checks for pending rcus in idle task transitions to detect grace period more efficiently.

7. rcu_ipi [doesn't work due to linux smp_call_function implementation]

This is an unsuccessfull attempt at using the call_function IPI to handle per-CPU queiscent state checking and processing of per-CPU RCU callback queues. The idea was simple, a timer will fire at regular intervals (say 10 milliseconds) and will use smp_call_function() to prod the other CPUs for RCU related processing. Things done in the IPI handler are similar to other periodic checking implementation - participate in quiescent state detection mechanism and process RCU callback queues if necessary. Unfortunately, call_function IPI mechanism doesn't work when sent from BH context (like a timer handler). Implementing a separate IPI for this would require modification of arch-dependent code which isn't worth the complexity. Status: abandoned.