Read-Copy Update Mutual Exclusion
Overview:
Multiprocessor systems require ensuring serialized execution of
critical sections of code that manipulate shared data structures. Various
mechanisms for mutual exclusion have been used in traditional Unix
kernels including spin locks, semaphores, reader-writer
spin locks etc. Even uniprocessor systems require controlled concurrency
when critical section code can be executed from both process as well
interrupt context. While short-term mutual exclusions like spin locks
are simple to use, with the advent of faster cpus and memory interconnect
speeds not keeping up with it, cost of acquiring spinning locks is increasing
with each generation of architecture. The wider this gap is, more the
number of cycles a cpu has to waste spin-waiting for some slow memory
interconnect to respond. Therefore, it has become increasingly necessary
to look for alternatives to conventional spin-waiting locking models.
Read-Copy Update is one such mutual exclusion method where readers
(threads trying to access, but not modify the data) can access the
shared data without acquiring any conventional lock. The writers
(threads that update the data) however, have to use a special
callback scheme to update the data. They update all the global
references to the updated data with a new copy and use the callback
scheme to free the old copy after all the CPUs have lost local
references to it by going through a quiescent state (like a context
switch). Because the write-side is significantly expensive
compared to the read side, Read-Copy Update
mechanism is more suited for scenerios where the data to be
protected is read more often than written. For uniprocessor systems,
Read-Copy Update mechanism eliminates the need to mask interrupts
for mutual exclusion. Read-Copy Update mechanism is thus
suitable for mutual exclusion in network routing tables, device state
tables, deferred deletion of data structures, multi-path I/O
device maintenance etc.
Read-Copy Update was originally designed for DYNIX/ptx, a UNIX
operating system from Sequent Computer Systems Inc., now a part
of IBM. Similar methods were also used for
Tornado and K42
OS projects at University of Toronto and IBM Research.
Documentation:
RCU-HOWTO in text and
html.
Read-Copy Update Detailed Documentation
Original paper on Read-Copy Update mechanism
Read-Copy Update Paper at Ottawa Linux Symposium, July 2001
Read-Copy Update Interfaces for Linux kernel:
Recent Read-Copy Update patches for 2.4 and 2.5 kernel provide
the following interfaces -
/*
* Structure to queue requests for invoking a callback at after
* all CPUs have gone through a "quiescent" state
*/
struct rcu_head {
struct rcu_head *next;
void (*func)(void *obj);
void *obj;
};
/* Invoke a callback after all CPUs go through a quiescent state */
void call_rcu(struct rcu_head *head, void (*func)(void *arg), void *arg);
/* Wait until all the CPUs go through a quiescent state */
void synchronize_kernel(void);
RCU Implementations:
Currently RCU has been implemented in several ways as mentioned
in the OLS paper and some more. All implementation provide
similar interface - a callback head structure rcu_head
that is to be called after all the CPUs have lost any pointers
to kernel data that they may have held when they went through atleast
one "quiescent" state like context switch, idle loop, user mode
code etc. The implementations differ in how they detect that
all the CPUs have gone through quiescent state(s).
Quiescent state counter based RCU:
A quiescent state is when a CPU loses its references to kernel
data structures and thus cannot interfere with manipulation of
in-kernel data or be interfered with by manipulation of in-kernel
data. If a given process removes all globall-accessible references
to a particular data structure and waits till all the CPUs pass
through a quiescent state, it can then manipulate that data
structure without any concern of interference from other CPUs.
In this approach of RCU, completion of a quiescent cycle where all CPUs
go through a quiescent state is detected by counting and monitoring
the quiescent states for all CPUs. Monitoring of the quiescent
state counters can be done in various different ways. We have
experimented with the following algorithms -
- Maintain a per-cpu counter that is incremented
whenever there is a context switch or local timer interrupt detects
an idle loop or user code. The callbacks are maintained in per-cpu
lists and a global batch numbering system is used to assign batch
numbers to different per-cpu callback batches. A global current
batch number indicates current batch of callbacks waiting
for a quiescent cycle to complete. Once one batch completes,
callbacks are moved to a "done" list and the next batch starts.
The done callbacks are invoked from a bottom-half context.
This approach has the advantage that there is very little overhead
in the fast path (schedule()), just a per-cpu counter increment.
Earlier implementations of this used per-cpu local timer to
check for callbacks which is in arch-dependent trees in linux. The latest
implementation however avoids that and has no code in arch-dependent
source trees.
- Use a per-cpu counter to count the context switches, but
instead of instead of waiting for the counter to be incremented
in each CPU, it forces a reschedule in all CPUs. This variant
called rcu_poll was contributed by Andrea Arcangeli and is a part of
aa series of kernels.
These implementations can be downloaded from
Read-Copy
Update Package in LSE.
Scheduler based RCU:
In this approach by Rusty Russell, the completion of RCU cycle
is detected using a set of per-CPU counters that are updated and checked
during context switches. The counter scheme allows detection of when
all CPUs have done atleast one context switch and the callbacks
are then invoked from scheduler context. One main advantage of this
scheme is that it allows sleeping in the callback.
Rusty is maintaining this patch
here.
Read-Copy Update Uses in Linux Kernel:
Scaling the dentry cache:
Using RCU it is possible to provide some relief from acquiring
dcache_lock. The main work of dentry cache is done in looking up
existing dentries in the cache which is done by d_lookup routine.
Using RCU (and lazy-lru algorithm), we could do lock-less lookup
for dentries and bring down the contention for dcache_lock while
running dbench considerably from 16.5% to 0.95% on an 8-way SMP box.
Recent measurements by Anton showed > 20% improvement in dbench
throughput for a 32-way PPC machine. It also shows considerably
less contention for other workloads like httperf. Details
of this work can be found
here.
These implementations can be downloaded from
Read-Copy
Update Package in LSE
Hot Plug CPU Support:
Linux Hotplug CPU Support
uses Read-Copy Update to support addition and removal of CPUs
safely.
Module Unloading and Cleanup:
The deferred update mechanism provided by Read-Copy Update can be useful in
providing a two phase cleanup logic to solve the race conditions involved in
module unloading. The following examples illustrate how RCU can be
exploited here -
module_unloading-2.4.2-0.1.patch - Module unloading using
Read-Copy Update for 2.4.2 kernel (Version 01) provides a new cleanup logic.
Readme explaining module unloading using Read-Copy Update.
Test cases for module unloading using Read-Copy Update.
A new framework for module unloading using RCU is being developed by
Rusty Russell.
Scalable FD Management:
The per-task files_struct can be shared between tasks
created using clone() with CLONE_FILES flag set. Due to bouncing
of the cache line containing the rwlock in files_struct,
such workloads can get slowed significantly, specially
in higher-end SMP and NUMA machines. RCU provides a
way to do lock-less lookup of file descriptors thereby
avoiding this problem completely. Our experiments show
that on a benchmark like "chat", the performance
improvement can be as high as 30% even on a 4-way system.
The details of the problem and the proposed solution with improved
results can be found here.
The patch is
files_struct_rcu-2.4.13pre5-06.patch.
IPV4 Route Cache Lookup:
The route cache is a hash table that caches routes and is
mostly read. Update to it is done only periodically when entries
are aged off or the cache is flushed. For the most part however it
is just looked up. The route cache entries are protected by a per-bucket
reader-writer lock. Our work here uses RCU to do lock-less
lookup of routes from route cache.
The patches for lock-less route cache lookup can be downloaded
from the
Read-Copy Update Package in LSE download site.
Read-Copy Update Previous Versions:
The earlier patches for Read-Copy Update can be downloaded
from the
Read-Copy Update Package in LSE download site.
RCU based on original DYNIX/ptx code (released by IBM under GPL):
rclock-2.4.1-01.patch
rclock-2.4.2-01.patch