Read Copy Update HOWTO Dipankar Sarma dipankar@in.ibm.com Kristin Thomas kristint@us.ibm.com 2001-09-03 Revision History Revision 0.1 2001-11-12 Revised by: KET This HOWTO is a concise guide to the Read-Copy Update mutual exclusion mechanism in the Linux kernel. --------------------------------------------------------------------------- --------------------------------------------------------------------------- Copyright and Legal Notice Copyright © 2001 IBM Corporation. All rights reserved. This document may be reproduced or distributed in any form without prior permission provided the copyright notice is retained on all copies. Modified versions of this document may be freely distributed, provided that they are clearly identified as such, and this copyright is included intact. This document is provided "AS IS," with no express or implied warranties. Use the information in this document at your own risk. Special Notices Linux is a trademark of Linus Torvalds. Other company, product, and service names may be trademarks or service marks of others. --------------------------------------------------------------------------- Introduction This document describes how to use the Read-Copy Update (RCU) mechanism in the Linux kernel for scalable mutual exclusion. The HOWTO provides examples and explains the interfaces associated with RCU. It also prepares you to save cycles using RCU. --------------------------------------------------------------------------- What is Read-Copy Update? Read-Copy Update (RCU) is a mutual exclusion primitive that takes advantage of the event-driven nature of operating systems and provides a way to read data structures shared between CPUs at zero cost. For read-mostly data structures, it reduces contention on conventional locks and lock cache line bouncing by allowing lock-free reading of data. RCU allows you to read a shared data structure as if there is no other CPU accessing it. When you need to update the data structure, you can update the global pointer to the data and keep the old copy until all the threads currently executing inside the kernel have completed. The updated pointer ensures that none of the CPUs have any remaining references, so the old copy can be deleted. --------------------------------------------------------------------------- Why use Read-Copy Update? The following list explains why you should use the RCU mechanism. It also explains the motivation for RCU's development. Increase in cost of conventional locks CPU speeds have been increasing by a factor of from 1.5x to 2x per year for more than a decade. However, interconnect speeds (printed circuit boards and buses) have only been increasing by roughly 15% per year over the same period. This increase in overhead provides the motivation for constructing mutual-exclusion schemes that do not rely so heavily on communication over the relatively slow interconnects between the relatively fast CPUs. The RCU mechanism allows locking operations to be eliminated on read accesses to the data it protects. Avoiding complicated races The RCU mechanism is not as susceptible to deadlock as are explicit lock operations. The reduced susceptibility to deadlock results in reduction in or elimination of code that would otherwise be required to detect and repair or to avoid deadlocks. This reduction also reduces software development and maintenance costs. Cache benefits of lock-free reads By allowing lock-free reading of data, RCU avoids thrashing of shared lock cache lines. In read-mostly situations, lock-free reading provides significant benefits. One such example is the synthetic "chat" benchmark where a number of cloned tasks share the files_struct. Even though a reader-writer lock is used to read the file pointer in fget(), the bouncing of the lock cache line severely impacts the performance when a large number of CPUs are used. Using RCU, fget() can look up the file pointer without acquiring any lock and that significantly improves the performance. --------------------------------------------------------------------------- Additional Resources Additional information about RCU can be found on the RCU homepage at http:/ /lse.sourceforge.net/locking/rcupdate.html. The white paper for RCU for Linux was published at Ottawa Linux Symposium, 2001. This paper is at http: //lse.sourceforge.net/locking/rclock_OLS.2001.05.01c.sc.pdf. --------------------------------------------------------------------------- Description of RCU Primitives The RCU mechanism consists of an interface that registers a callback function This function is called after all the CPUs in the system have gone through at least one "quiescent" state (such as context switch, idle loop, or user code). This function can be used while updating shared data structures. However, more than one update cannot be performed at the same time, so each update must be serialized using conventional mutual exclusion mechanisms. Typical update side code would look follow this process: 1. Lock the update side. 2. Update the global pointer to the data structure (struct rcu_head) with the new copy or NULL. After this step any new kernel threads in another CPU can see only the new pointer. 3. Invoke a callback (call_rcu) to free the old copy after all CPUs go through a quiescent point. This step ensures that when the copy is freed, no CPU can hold a reference to this data structure. 4. Unlock the update side. The header file to performs this process is Linux/rcupdate.h. --------------------------------------------------------------------------- struct rcu_head This structure contains all the information necessary for RCU mechanism to maintain pending updates. Each update corresponds to one struct rcu_head. Typically, the rcu_head structure is embedded in the data structure that uses RCU for updating and lookup. struct rcu_head { struct list_head list; void (*func)(void *obj); void *arg; }; #define RCU_HEAD_INIT(head) \ { LIST_HEAD_INIT(head.list), NULL, NULL } #define RCU_HEAD(head) \ struct rcu_head head = RCU_HEAD_INIT(head) #define INIT_RCU_HEAD(ptr) do { \ INIT_LIST_HEAD(&(ptr)->list); (ptr)->func = NULL; \ (ptr)->arg = NULL; \ } while (0) --------------------------------------------------------------------------- call_rcu() void call_rcu(struct, rcu_head, *head, void, (*func) (void, *head) ); This call invokes a callback function, func after all the CPUs have gone through at least one quiescent state. For performance reasons, good implementations of RCU do not wait for completion of the quiescent cycle. Instead, they queue the cycle in batches and return. The callback will be invoked later when all the CPUs go through at least one quiescent state. The callback function may be invoked from bottom-half context, so the user must take that into consideration. --------------------------------------------------------------------------- synchronize_kernel void synchronize_kernel(void); This interface waits until all the CPUs of the system have gone through a quiescent state, like context switch. The interface blocks, so it can be called from process context only. Calling from process context is useful in situations where the update of an RCU protected data is not performance critical. For example, module unloading is not performance critical, so it can use this interface if it is protected by RCU. --------------------------------------------------------------------------- Use of Memory Barriers Ordering of reads and writes in critical sections is taken care of implicitly in the locking primitives of conventional locks. However, with a lock-free read of data, as supported by RCU, ordering of memory accesses by CPUs is important. The Linux kernel already provides a set of memory barrier operations. The most commonly used barrier interfaces for RCU are: wmb() and rmb(). Note: In future versions of the kernel, the names of these interfaces might be changed to write_barrier() and read_barrier() . For example, consider a situation where a stale copy of the data structure cannot be tolerated by the algorithm. You can avoid this situation with the following approach: Update side: Mark entry obsolete Remove it from the list Schedule a RCU callback to free it after quiescent cycle Read side: For each element if element is obsolete continue if element->value == key return element; If the writes to mark the entry obsolete and its removal are re-ordered by either the compiler or the CPU, this code will not function correctly. You can avoid using a write memory barrier during update, as follows: Mark entry obsolete wmb(); /* write_barrier() later */ Remove it from the list Schedule a RCU callback to free it after quiescent cycle For example, on CPUs such as DEC Alpha, wmb() does not guarantee that writes are seen in order on the read side, even if there is a data dependency (setting of element pointer in the loop header and using element pointer in the loop body). For these CPUs, there must be an explicit rmb() to ensure that the writes are seen in order, as follows: For each element rmb(); if element is obsolete continue if element->value == key return element; Most CPUs ensure read-side ordering if there is a data dependency. To avoid the unnecessary penalty, a new interface, read_barrier_depends(), is under development. This interface has an rmb() for Alpha only. --------------------------------------------------------------------------- Applications of RCU This section describes several example applications of RCU. --------------------------------------------------------------------------- RCU for a Hash Table with refcnt One common application of RCU is a list structure protected by a global mutual exclusion whose individual elements are protected by a per-element mutual exclusion. Usually, searches of the list are more common than insertions or deletions. For example, the code for simple search and delete functions for a circular, doubly-linked list of elements would have many lock round-trips and some complicated deadlock-avoidance code, if you did not use call_rcu. The following code is much shorter and more simple, and it uses call_rcu: struct element { struct element *next; struct element *prev; atomic_t refcnt; int address; int data; struct rcu_head rcu; }; /* * Find element with specified address, lock it, and return * a pointer to it. Return NULL if no such address. */ struct element *search(int addr) { struct element *p; struct element *p0; p0 = &head[hashit(addr)]; rmb(); /* read_barrier_depends() */ p = p0->next; while (p != p0) { rmb(); /* read_barrier_depends() */ if (p->address == addr) { atomic_inc(&p->refcnt); return (p); } p = p->next; } return ((struct element *)NULL); } /* * Insert an element into the hash table */ void insert(struct element *element) { struct element *p0; p0 = &head[hashit(p->addr)]; p->next = p0->next; /* * Make sure that next pointer of the inserted * element is visible to other CPUs before * actual insertion. This allows lock-free * lookup of the bucket list to search * the entire list. */ wmb(); p0->next = p; } void myfree(void *obj) { struct element *p = obj; /* * Now it is safe to check refcnt since * noone can be holding a stale pointer * to this element got during lock-free * search(). */ if (atomic_read(&p->refcnt)) { /* * Let some other garbage collection * free it */ return; } kfree(obj); } /* * Delete the specified element from the table. Element * must be locked. */ void delete(struct element *p) { /* * Get global lock. There is no possible deadlock, * since the global lock is now always acquired * after the per-element lock. */ spin_lock(&hash_lock); /* * Delete the element from the list, release * lock, free up the element, and return. */ p->next->prev = p->prev; p->prev->next = p->next; spin_unlock(&hash_lock); call_rcu(&p->rcu, myfree, p); return; } --------------------------------------------------------------------------- RCU for use with kmem_cache_free() Unlike [kv]free(), kmem_cache_free() requires one additional pointer to the cache that frees any memory allocated from it. This pointer can be a problem for RCU because it is designed for only one pointer. You can avoid this problem by using a specific destroyer function for each cache that requires deferred freeing. If the cache pointer is a global variable, the destroyer function can directly use that and invoke kmem_cache_free() in the callback. Otherwise, the cache pointer can be embedded in the data structure along with struct rcu_head and can be re-computed by the destroyer function. For global cache pointers the function is as follows: static kmem_cache_t xxx_cache; struct xxx_entry { struct xxx_entry *next; struct xxx_entry *prev; int flag; int info; struct rcu_head rch; } void xxx_destroy(void *ptr) { struct xxx_entry *xp = (struct xxx_entry *)ptr; kmem_cache_free(&xxx_cache, (void *)xp); } void xxx_delete(struct xxx *xp) { call_rcu(&xp->rch, xxx_destroy, xp); } If the cache pointer is not available as a global variable, it can still be embedded in the structure and accessed during destruction as follows: struct xxx_entry { struct xxx_entry *next; struct xxx_entry *prev; int flag; int info; kmem_cache_t *cachep; struct rcu_head rch; } void xxx_destroy(void *ptr) { struct xxx_entry *xp = (struct xxx_entry *)ptr; kmem_cache_free(&xp->cachep, (void *)xp); } void xxx_delete(struct xxx *xp) { call_rcu(&xp->rch, xxx_destroy, xp); } --------------------------------------------------------------------------- To Do --------------------------------------------------------------------------- Contributions