Read-Copy Update Mutual Exclusion

Overview:

Multiprocessor systems require ensuring serialized execution of critical sections of code that manipulate shared data structures. Various mechanisms for mutual exclusion have been used in traditional Unix kernels including spin locks, semaphores, reader-writer spin locks etc. Even uniprocessor systems require controlled concurrency when critical section code can be executed from both process as well interrupt context. While short-term mutual exclusions like spin locks are simple to use, with the advent of faster cpus and memory interconnect speeds not keeping up with it, cost of acquiring spinning locks is increasing with each generation of architecture. The wider this gap is, more the number of cycles a cpu has to waste spin-waiting for some slow memory interconnect to respond. Therefore, it has become increasingly necessary to look for alternatives to conventional spin-waiting locking models.

Read-Copy Update is one such mutual exclusion method where readers (threads trying to access, but not modify the data) can access the shared data without acquiring any conventional lock. The writers (threads that update the data) however, have to use a special callback scheme to update the data. They update all the global references to the updated data with a new copy and use the callback scheme to free the old copy after all the CPUs have lost local references to it by going through a quiescent state (like a context switch). Because the write-side is significantly expensive compared to the read side, Read-Copy Update mechanism is more suited for scenerios where the data to be protected is read more often than written. For uniprocessor systems, Read-Copy Update mechanism eliminates the need to mask interrupts for mutual exclusion. Read-Copy Update mechanism is thus suitable for mutual exclusion in network routing tables, device state tables, deferred deletion of data structures, multi-path I/O device maintenance etc.

Read-Copy Update was originally designed for DYNIX/ptx, a UNIX operating system from Sequent Computer Systems Inc., now a part of IBM. Similar methods were also used for Tornado and K42 OS projects at University of Toronto and IBM Research.

Documentation:

RCU-HOWTO in text and html.
Read-Copy Update Detailed Documentation
Original paper on Read-Copy Update mechanism
Read-Copy Update Paper at Ottawa Linux Symposium, July 2001

Paul McKenney's RCU Web Page

Read-Copy Update Interfaces for Linux kernel:

Recent Read-Copy Update patches for 2.4 and 2.5 kernel provide the following interfaces -

/*
 * Structure to queue requests for invoking a callback at after
 * all CPUs have gone through a "quiescent" state
 */
struct rcu_head {
	struct rcu_head *next;
	void (*func)(void *obj);
	void *obj;
};

/* Invoke a callback after all CPUs go through a quiescent state */
void call_rcu(struct rcu_head *head, void (*func)(void *arg), void *arg);

/* Wait until all the CPUs go through a quiescent state */
void synchronize_kernel(void);

RCU Implementations:

Currently RCU has been implemented in several ways as mentioned in the OLS paper and some more. All implementation provide similar interface - a callback head structure rcu_head that is to be called after all the CPUs have lost any pointers to kernel data that they may have held when they went through atleast one "quiescent" state like context switch, idle loop, user mode code etc. The implementations differ in how they detect that all the CPUs have gone through quiescent state(s).

Quiescent state counter based RCU:

A quiescent state is when a CPU loses its references to kernel data structures and thus cannot interfere with manipulation of in-kernel data or be interfered with by manipulation of in-kernel data. If a given process removes all globall-accessible references to a particular data structure and waits till all the CPUs pass through a quiescent state, it can then manipulate that data structure without any concern of interference from other CPUs.

In this approach of RCU, completion of a quiescent cycle where all CPUs go through a quiescent state is detected by counting and monitoring the quiescent states for all CPUs. Monitoring of the quiescent state counters can be done in various different ways. We have experimented with the following algorithms -

Maintain a per-cpu counter that is incremented whenever there is a context switch or local timer interrupt detects an idle loop or user code. The callbacks are maintained in per-cpu lists and a global batch numbering system is used to assign batch numbers to different per-cpu callback batches. A global current batch number indicates current batch of callbacks waiting for a quiescent cycle to complete. Once one batch completes, callbacks are moved to a "done" list and the next batch starts. The done callbacks are invoked from a bottom-half context. This approach has the advantage that there is very little overhead in the fast path (schedule()), just a per-cpu counter increment. Earlier implementations of this used per-cpu local timer to check for callbacks which is in arch-dependent trees in linux. The latest implementation however avoids that and has no code in arch-dependent source trees.
Use a per-cpu counter to count the context switches, but instead of instead of waiting for the counter to be incremented in each CPU, it forces a reschedule in all CPUs. This variant called rcu_poll was contributed by Andrea Arcangeli and is a part of aa series of kernels.

These implementations can be downloaded from Read-Copy Update Package in LSE.

Scheduler based RCU:

In this approach by Rusty Russell, the completion of RCU cycle is detected using a set of per-CPU counters that are updated and checked during context switches. The counter scheme allows detection of when all CPUs have done atleast one context switch and the callbacks are then invoked from scheduler context. One main advantage of this scheme is that it allows sleeping in the callback.

Rusty is maintaining this patch here.

Read-Copy Update Uses in Linux Kernel:

Scaling the dentry cache:

Using RCU it is possible to provide some relief from acquiring dcache_lock. The main work of dentry cache is done in looking up existing dentries in the cache which is done by d_lookup routine. Using RCU (and lazy-lru algorithm), we could do lock-less lookup for dentries and bring down the contention for dcache_lock while running dbench considerably from 16.5% to 0.95% on an 8-way SMP box. Recent measurements by Anton showed > 20% improvement in dbench throughput for a 32-way PPC machine. It also shows considerably less contention for other workloads like httperf. Details of this work can be found here.
These implementations can be downloaded from Read-Copy Update Package in LSE

Hot Plug CPU Support:

Linux Hotplug CPU Support uses Read-Copy Update to support addition and removal of CPUs safely.

Module Unloading and Cleanup:

The deferred update mechanism provided by Read-Copy Update can be useful in providing a two phase cleanup logic to solve the race conditions involved in module unloading. The following examples illustrate how RCU can be exploited here -

module_unloading-2.4.2-0.1.patch - Module unloading using Read-Copy Update for 2.4.2 kernel (Version 01) provides a new cleanup logic.
Readme explaining module unloading using Read-Copy Update.
Test cases for module unloading using Read-Copy Update.

A new framework for module unloading using RCU is being developed by Rusty Russell.

Scalable FD Management:

The per-task files_struct can be shared between tasks created using clone() with CLONE_FILES flag set. Due to bouncing of the cache line containing the rwlock in files_struct, such workloads can get slowed significantly, specially in higher-end SMP and NUMA machines. RCU provides a way to do lock-less lookup of file descriptors thereby avoiding this problem completely. Our experiments show that on a benchmark like "chat", the performance improvement can be as high as 30% even on a 4-way system.

The details of the problem and the proposed solution with improved results can be found here. The patch is files_struct_rcu-2.4.13pre5-06.patch.

IPV4 Route Cache Lookup:

The route cache is a hash table that caches routes and is mostly read. Update to it is done only periodically when entries are aged off or the cache is flushed. For the most part however it is just looked up. The route cache entries are protected by a per-bucket reader-writer lock. Our work here uses RCU to do lock-less lookup of routes from route cache.

The patches for lock-less route cache lookup can be downloaded from the Read-Copy Update Package in LSE download site.

Read-Copy Update Previous Versions:

The earlier patches for Read-Copy Update can be downloaded from the Read-Copy Update Package in LSE download site.

RCU based on original DYNIX/ptx code (released by IBM under GPL):

rclock-2.4.1-01.patch
rclock-2.4.2-01.patch