The first phase is fairly straightforward, it is the implementation of the second phase where the current algorithms differ. One simple way (described in the OLS2001 paper) is to keep rescheduling the task needing to update until it has run on all the CPUs. This ensures that all the CPUs have lost reference to the protected data and update can now be done. The obvious drawback is that update path is fraught with overheads - additional context switches, waiting for the "grace period" to complete etc. An alternative approach is to batch the updates and complete them asynchronously when it is safe to do so. This the approach that we chose for our experiments. The implementations that we did essentially follow this logic, but vary on how they determine that it is safe to complete the update (end of grace period). The policies for determining grace period are -
Description of various implementations:
1. X-rcu
This is a experimental implementation that uses features likely to be available in future kernels. Using these features, this implementation leads to simple code. Some of the features used by this are -
The basic algorithm is same as the original ptx algorithm -
It uses per-CPU queues to queue callbacks to be invoked when it is safe. A timer is registered for each CPU and each of them check for quiescent states at an infrequent interval. For a batch of callbacks, when all CPUs check in, the grace period is over and it is safe to invoke the callbacks in the batch. The modified smptimers patch guarantees that timer queued in a CPU gets fired in the same CPU (Ingo's smptimers doesn't). This allows us to emulate the ptx algorithm without modifying arch-dependent code (CPU-local timer).
2. rcu
It is also based on the old ptx algorithm. The salient features of this implementation are -
3. rcu_poll
This patch uses the second policy to determine the grace period. For simplicity, it uses global batches of RCU callbacks protected by a global lock. The assumption here is that RCU will be a very infrequent event. For any batch of callbacks, it forces a reschedule on all CPUs thereby immediately completing a grace period. In the meanwhile, a tasklet is repeatedly scheduled to keep checking for the completion of the grace period. The basic algorithm is this -
For RCU, queue the callback and start the RCU tasklet In the RCU tasklet, if one batch is not waiting for grace period to be over, start the grace period for the new batch, force reschdule on all the CPUs. Continue to reschedule the tasklet until all the CPUs have checked in for the grace period. else Check if all the CPUs have checked in for the current grace period and process accordingly. [ If all CPUs check in, the grace period is over and current batch of callbacks are invoked. Checking for grace period is done by monitoring a per-CPU context switch counter. ]
If there is no RCU pending, then there is no overhead other than the per-CPU context switch counter increment.
4. rcu_ltimer
This very similar to the original ptx implementation. It uses the per-CPU local timer handler to check if the corresponding CPU has quiesced or not. The rest is handled pretty much like the rcu implementation. Whether there is RCU pending or not, the checking is always done. Quiescent CPU checking is done using a per-CPU context switch counter.
5. rcu_taskq
This uses the second policy of determining grace period - it forces the CPUs to quiesce and complete the grace period quickly. This is achieved using a set of per-CPU daemons (krcud) that sleep until woken up to put the CPUs through quiescent state. The callback queues are global (as opposed to per-CPU). A bottom half handler (taskqueue) is used to wake up the krcuds. Once done, the grace period is over and callbacks can now be invoked.
A major problem with this implementation is that keventd (used for taskqueue implementation) can get starved since it runs at a very low priority. That may cause RCU to prolong the grace period which may in turn increase the memory pressure on the system.
6. rcu_sched [Rusty]
Implemented by Rusty Russell, this algorithm uses a ring of per-cpu counters which are handled during context switch if there is any RCU pending in the system. Each CPU's counter is set to one greater than its neighbor. When the neighbor's counter change it is guranteed that all the cpus have passed through a context switch. rcu_sched optimizes queueing of callbacks by maintaining separate queues for different interrupt levels. It also checks for pending rcus in idle task transitions to detect grace period more efficiently.
7. rcu_ipi [doesn't work due to linux smp_call_function implementation]
This is an unsuccessfull attempt at using the call_function IPI to handle per-CPU queiscent state checking and processing of per-CPU RCU callback queues. The idea was simple, a timer will fire at regular intervals (say 10 milliseconds) and will use smp_call_function() to prod the other CPUs for RCU related processing. Things done in the IPI handler are similar to other periodic checking implementation - participate in quiescent state detection mechanism and process RCU callback queues if necessary. Unfortunately, call_function IPI mechanism doesn't work when sent from BH context (like a timer handler). Implementing a separate IPI for this would require modification of arch-dependent code which isn't worth the complexity. Status: abandoned.