Scalable Directory Entry Cache

Dipankar Sarma <>
Maneesh Soni <>
Hanna Linder <>
Troy Wilson <>

1.0 Introduction

The directory entry cache is used to facilitate quick lookup of inodes from file paths. The basic lookup is done through a hash table, but the entries are also linked together in several other lists providing various information like hierarchy, alias and aging.

Currently dentry cache is protected by a single global lock, dcache_lock. This lock is held during all operations on the dentry hash table. It is also used to protect all other lists (d_child, d_alias, d_lru etc.) that connects the directory entries. It doesn't seem to be an issue with lower-end SMP machines, but as the number of CPUs increase, filesystem intensive benchmarks show sign that dcache_lock is contentious . We measured the scalability of directory entry cache using dbench, httperf and specweb99 benchmarks and experimented with various ways to enhance its performance.

2.0 A look at the issues

So far we have used 3 benchmarks for understanding scaling of dentry cache - SPECWeb99, dbench (with settings to avoid I/O as much as possible) and httperf.


SPEC(tm) and the benchmark name SPECweb(tm) are registered trademarks of the Standard Performance Evaluation Corporation. The benchmarking were done for research purpose only and were non-compliant with the following devaitions from the rules -

  1. It was run on hardware that does not meet the SPEC availability-to-the-public criteria. The machine was an engineering sample.
  2. access_log wasn't kept for full accounting. It was being written, but deleted every 200 seconds.
For the latest SPECweb99 benchmark results visit

2.1 SPECWeb99 Results

The SPECWeb99 measurements were done on a 8-CPU PIII Xeon.

Kernel : 2.4.17 + lse02E
Throughput : 2258 simultaneous connections

Lockmeter statistics for this benchmarks reveals -
SPINLOCKS         HOLD            WAIT
 15.7% 20.8%  2.1us(6668us)   23us(  14ms)( 4.4%)   5215460 79.2% 20.8%    0%  dcache_lock
 0.19% 19.4%  1.7us(1327us)   66us(  13ms)(0.17%)     75355 80.6% 19.4%    0%    d_alloc+0x130
 0.12% 20.8%  1.1us(1225us)   16us(2907us)(0.05%)     75353 79.2% 20.8%    0%    d_instantiate+0x1c
 13.5% 20.9%  2.1us(6668us)   20us(  13ms)( 3.3%)   4433721 79.1% 20.9%    0%    d_lookup+0x58
 0.16% 14.1%  1.5us(1065us)   17us(1930us)(0.03%)     75353 85.9% 14.1%    0%    d_rehash+0x3c
  1.7% 21.0%  2.1us(1119us)   40us(  14ms)(0.84%)    555661 79.0% 21.0%    0%    dput+0x30
 0.00%  5.9%  2.2us( 5.8us)  5.5us( 5.5us)(0.00%)        17 94.1%  5.9%    0%    link_path_walk+0x28c
Kernel profiling output shows that directory entry cache operations are taking significant amount of time -
   176 dput                                       0.5432
    28 d_invalidate                               0.2000
    46 d_alloc                                    0.1223
    12 d_instantiate                              0.1765
  1293 d_lookup                                   4.4896
    19 d_rehash                                   0.2159
This clearly indicates that the global locking scheme for the directory entry cache is a problem and needs to be solved immediately.

2.2 Dbench Results

The dbench results shows that lock utilization and contention, while not significant, grows steadily with increasing number of CPUs. 8-CPU SMP results show 5.3% utilization with 16.5% contention on this lock.

The latest numbers are based on 2.4.16 kernel. The numbers were measured in a 8-CPU P-III xeon (1MB L2 cache) box with 2 GB RAM.

One significant observation with the lockmeter output is that for this workload d_lookup() is the most commonly done operation.

SPINLOCKS         HOLD            WAIT
  5.3% 16.5%  0.6us(2787us)  5.0us(3094us)(0.89%)  15069563 83.5% 16.5%    0%  dcache_lock
 0.01% 10.9%  0.2us( 7.5us)  5.3us( 116us)(0.00%)    119448 89.1% 10.9%    0%    d_alloc+0x128
 0.04% 14.2%  0.3us(  42us)  6.3us( 925us)(0.02%)    233290 85.8% 14.2%    0%    d_delete+0x10
 0.00%  3.5%  0.2us( 3.1us)  5.6us(  41us)(0.00%)      5050 96.5%  3.5%    0%    d_delete+0x94
 0.04% 10.9%  0.2us( 8.2us)  5.3us(1269us)(0.01%)    352739 89.1% 10.9%    0%    d_instantiate+0x1c
  4.8% 17.2%  0.7us(1362us)  4.8us(2692us)(0.76%)  12725262 82.8% 17.2%    0%    d_lookup+0x5c
 0.02% 11.0%  0.9us(  22us)  5.4us(1310us)(0.00%)     46800 89.0% 11.0%    0%    d_move+0x38
 0.01%  5.1%  0.2us(  37us)  4.2us(  84us)(0.00%)    119438 94.9%  5.1%    0%    d_rehash+0x40
 0.00%  2.5%  0.2us( 3.1us)  5.6us(  45us)(0.00%)      1680 97.5%  2.5%    0%    d_unhash+0x34
 0.31% 15.0%  0.4us(  64us)  6.2us(3094us)(0.09%)   1384623 85.0% 15.0%    0%    dput+0x30
 0.00% 0.82%  0.4us( 4.2us)  6.4us( 6.4us)(0.00%)       122 99.2% 0.82%    0%    link_path_walk+0x2a8
 0.00%    0%  1.7us( 1.8us)    0us                        2  100%    0%    0%    link_path_walk+0x618
 0.00%  6.4%  1.9us( 832us)  5.0us(  49us)(0.00%)      3630 93.6%  6.4%    0%    prune_dcache+0x14
 0.04%  9.4%  1.0us(1382us)  4.7us( 148us)(0.00%)     70974 90.6%  9.4%    0%    prune_dcache+0x138
 0.04%  4.2%   11us(2787us)  3.8us(  24us)(0.00%)      6505 95.8%  4.2%    0%    select_parent+0x20

This snippet of lockmeter output for 8-CPU SMP shows that 84% of the time dcache_lock was acquired by d_lookup().

2.3 Httperf Results

It shows a moderate utilization of 6.2% with 4.3% contention in an 8 CPU environment.

The latest numbers are based on 2.4.16 kernel. The numbers were measured in a 8-CPU P-III xeon (1MB L2 cache) box with 2 GB RAM.

A snippet of lockmeter out put showing the distribution of acquisition of dcache_lock is as following -

SPINLOCKS         HOLD            WAIT
  6.2%  4.3%  0.8us( 390us)  2.7us( 579us)(0.12%)  20243025 95.7%  4.3%    0%  dcache_lock
 0.02%  6.5%  0.5us(  45us)  2.7us( 281us)(0.00%)    100031 93.5%  6.5%    0%    d_alloc+0x128
 0.01%  4.9%  0.2us( 4.6us)  2.9us(  58us)(0.00%)    100032 95.1%  4.9%    0%    d_instantiate+0x1c
  5.0%  4.5%  0.8us( 387us)  2.8us( 579us)(0.09%)  15009129 95.5%  4.5%    0%    d_lookup+0x5c
 0.02%  5.8%  0.6us(  34us)  3.1us(  45us)(0.00%)    100031 94.2%  5.8%    0%    d_rehash+0x40
 0.19%  8.8%  0.5us( 296us)  2.8us( 315us)(0.01%)    933218 91.2%  8.8%    0%    dput+0x30
 0.89%  2.3%  0.6us( 390us)  2.5us( 309us)(0.01%)   4000584 97.7%  2.3%    0%    link_path_walk+0x2a8

This shows that 74% of the time the global lock is acquired from d_lookup().

2.4 Summary of baseline measurements

The baseline measurements show that dcache_lock while not very hot shows increasing level of contention for some benchmarks. It is probably not an immediate bottleneck only due to other bottlenecks like BKL. While looking at the distribution of lock aqcuisitions for these workloads, it becomes obvious that d_lookup() is the routine to optimize since it is the routine from where the global lock is acquired most.

3.0 Solutions for directory entry cache scaling

In our effort to improve dcache scaling, we worked on a set of different approaches. They ranged from using finer grain locking to lockfree dcache lookups. We describe these approaches in this section.

3.1 Finer grain locking for dcache

A common locking scheme in hash tables is to use per-bucket locks. We tried to use this approach with the dentry cache, but it resulted in lots of nasty and hard to debug race conditions and it never worked properly. The fundamental problem is that dentry cache has several other lists that span across buckets and to modify/access any of those list we need to take multiple bucket locks. This results in a serious lock ordering problem and it turned out to be unusable. Several variations were tried including one where per-bucket lock protected the hash chain and the global dcache_lock protected the rest. All these logics resulted in very complicated code and they never worked satisfactorily. Therefore this approach was abandoned.

3.2 Big-reader dcache_lock

As the base line lockmeter data shows, a large part of the dcache operations involve looking up the hash table. Based on our earlier presentaion of this data in linux-kernel mailing list, Mark Hahn had suggested that we try big-reader lock. There is still a significant number of updates to directory entries, so on the surface big-reader lock didn't seem suitable. Nevertheless, in order to complete our search for solutions, we implemented a big-reader lock based directory cache. The implementation is very simple, the dcache_lock is just replaced by BRLOCK_DCACHE. Places like d_lookup where we just read from the dentry cache, we use br_read_lock(), elsewhere br_write_lock() is used. The following situations were handled differently -

  1. LRU update in d_lookup() - To update the LRU list of directory entries after looking it up, we need to hold the br_write_lock(). Since lookups are done so often, we use a lazy update of LRU list to avoid doing a br_write_lock() in the fast path. This is similar to the lazy LRU approach used with RCU based optimization described later.
  2. dput() uses atomic_dec_and_lock() reduce contention on dcache_lock. This isn't possible with brlock, so we unconditionally do a br_write_lock().

3.3 Fast walk dentry cache

3.4 RCU based lockfree lookup

dcache_lock is held in d_lookup() while traversing the d_hash list and to update the LRU list for freeing if the dentry found has zero ref count. By using RCU we can avoid dcache_lock while reading d_hash list. That is what the first RCU dcache patch dcache_rcu-2.4.10-01 did. In this, we were doing d_hash lookup lock-free but had to acquire dcache_lock while updating the LRU list. This patch does provide some decrease in lock holdtime and contention level to some extent. Lockmeter statistics for base kernel (2.4.16) while running dbench are here and for this minimal patch are here.

This proved that just doing the lock-free lookup of the dentry cache is not enough. d_lookup() also acquires dcache_lock to update the LRU list if the newly looked up entry previously had zero reference count. For dbench workload, most often, this was true and hence we ended up acquiring the lock after almost every lock-free lookup of hash table in d_lookup(). We somehow needed to avoid acquiring dcache_lock so often. Different alogorithms were tried to get rid of this lock from d_lookup(). Here is the summary of the results.

3.4.1 Seperate lock for LRU list

Though this logic worked but the entire locking overhead got transferred to the LRU list lock and practically there was no gain at all. The idea behind this was that as d_lookup() only updates the LRU list we can ease dcache_lock by introducing a seperate lock for LRU list. But the routines like prune_dcache, select_parent, d_prune_aliases etc read/write other lists also apart from LRU at these places both locks were needed. dcache_rcu-lru_lock-2.4.16-02.patch. This patch is not tested completely except running dbench and collecting lockmeter statistics. (lockmeter statistics).

3.4.2 Per bucket lock for hash chain & LRU list

This approach was confronted by lots of nasty and hard-to-debug race conditions and it never worked properly. The fundamental problem was that dentry cache has several other lists that span across multiple buckets and to modify/access any of those list we need to take multiple bucket locks. This results in a serious lock ordering problem and it turned out to be too difficult to solve without completely redesigning the path name lookup subsytem. Another problem in this case was that both the per-bucket lock and dcache_lock were needed at many places. The entire logic became quite complicated and we could not get any statistics out of this as it never worked except booting. The approach was abandoned. dcache_rcu-bucket-2.4.16-05.patch.

3.4.3 Lazy LRU dentry cache

One fundamental realization that comes out of these experiments is that it is better to keep the dcache_lock around and go for optimizations around it. This keeps things simple.

Given that lock-free traversal of hash chains is not enough for reducing dcache_lock acquisitions, we looked at the possibility of removing dcache_lock acquisitions completely from d_lookup(). It seemed that after using RCU based lock-free hash lookup, only reason dcache_lock was needed in d_lookup() was for updating the LRU list if the entry previously had zero refcount. We decided to the relax the LRU list norms by letting entries with non-zero refcount be in it and delay the updation of the LRU list. The logic behind this was that instead of acquiring the global dcache_lock for every single entry, we would process a number of entries per acquisition of the lock. To accomplish this, we introduced another flag in dentry to mark it and a per-entry lock to maintain consistency between the flag and the refcount. For all other fields, the refcount still provided the mutual exclusion. We also needed to make some decisions to ensure that we don't have too many entries in the LRU list with non-zero refcount so that the deletion process becomes very lengthy. For this we decided to try out two approaches - periodic updation of LRU list using a timer and periodic updation of LRU list during various traversals. Both these appraches use the same tactics in conjunction with RCU, they only differ in how the periodic updation is done. Notes on lazy dcache implementation

  1. Contention for dcache_lock is reduced in all routines at the expense of prune_dcache(). prune_dache() and select_parent() takes more time as d_lru list is longer. Pruning doesn't happen frequently, but if it does, d_lru list will not be that long anyway and hence dcache_lock round trip for pruning will be less. We will investigate pruning later.
  2. A per-dentry lock(d_lock) is used to protect the d_vfs_flags and d_count in d_lookup(). Right now apart from d_lookup, per-dentry lock (dentry->d_lock) is used whereever we are reading or modifying d_count or d_vfs_flags. It should be possible to tune this code better and relax this locking scheme. We will investigate this later.
  3. Due to the locking hierarchy of dcache_lock and the per-dentry lock, dput() code has become slightly ugly - it needs to check the reference count again after acquiring the per-dentry lock.
  4. While doing lock-free lookup using RCU, another CPU can delete an entry from the hash list. In the dentry cache, list_del_init() init is used for this and this may result in lookups not seeing the rest of the hash list. To avoid this, we save the next entry pointer in d_nexthash and mark the dentry DCACHE_UNLINKED before deleting and initializing the dentry. For looking up, this flag is checked to decide which next field is to be used for hash list traversal. This whole approach can be avoided (and simplified) by doing a list_del() instead, but that requires changing the code at many places that checks for unhashed directory entry. This is currently under development.
  5. Proper memory barriers must be in place for weakly ordered CPUs to work correctly with lockfree lookup. There two places where these are important - while deleting an entry, marking of the entry as deleted (DCACHE_UNLINKED) must be written out first before the write to the next pointer goes out. On the read side, we need to use a read_barrier_depends() for every iteration of the walking of the hash list.
  6. The d_lru list is made uptodate through select_parent(), prune_dcache(), and dput(). The dentries with non-zero reference count are removed from d_lru_list in those routines.
  7. Now every dget() marks the dentry as referenced by setting DCACHE_REFERENCE bit in d_vfs_flags. So we have to always take the per dentry lock in dget() and there is no dget_locked(). Lazy LRU dcache patches

For lazy LRU updation we also tried a timer based approach. In this, a timer was used to remove the referenced dentries from LRU list so that LRU list can be kept short. But this does not prove any better than no-timer approach. Also as we have to take dcache_lock from the timer handler we had to use spin_lock_bh() nad spin_unlock_bh() for dcache_lock. This also created the prbolem of cyclic dependencies in dcache.h. But the patch is worth looking as probably proper tuning of timer frequency can give better results. dcache_rcu-lazy_lru-timer-2.4.16-04.patch

The dcache_rcu-lazy_lru patch has been renamed as dcache_rcu since that is the final algorithm that we chose. dcache_rcu patches are available for download from Read-Copy Update package in LSE download section.

4. Results of directory entry improvement work

4.1 Big-reader lock results

4.2 Fastwalk results

4.3 Lazy dcache results

We ran dbench and httperf for understanding the effect of lazy dcache and results were very good. By doing a lock-free d_lookup(), we were able to substantially cut down the number of dcache_lock acquisitions. This resulted in substantially decreased contention as well as lock utilizations.

4.3.1 dbench results

dbench results showed that lock utilization and contention levels remain flat with lazy dcache as opposed to steadily increasing with the baseline kernel. So for 8 processors, contention level is 0.95% as opposed to 16.5% for the baseline (2.4.16) kernel.

SPINLOCKS         HOLD            WAIT
 0.89% 0.95%  0.6us(6516us)   19us(6411us)(0.03%)   2330127 99.0% 0.95%    0%  dcache_lock
 0.02%  1.7%  0.2us(  20us)   17us(2019us)(0.00%)    116150 98.3%  1.7%    0%    d_alloc+0x144
 0.03% 0.42%  0.2us(  49us)   35us(6033us)(0.00%)    233290 99.6% 0.42%    0%    d_delete+0x10
 0.00% 0.14%  0.8us(  12us)  3.4us( 8.5us)(0.00%)      5050 99.9% 0.14%    0%    d_delete+0x98
 0.03% 0.40%  0.1us(  32us)   34us(5251us)(0.00%)    349441 99.6% 0.40%    0%    d_instantiate+0x1c
 0.05% 0.30%  1.7us(  44us)   22us(1770us)(0.00%)     46800 99.7% 0.30%    0%    d_move+0x38
 0.01% 0.16%  0.1us(  21us)  4.5us( 334us)(0.00%)    116140 99.8% 0.16%    0%    d_rehash+0x40
 0.00% 0.65%  0.7us( 3.7us)  8.4us(  57us)(0.00%)      1680 99.3% 0.65%    0%    d_vfs_unhash+0x44
 0.56%  1.1%  0.7us(  84us)   18us(6411us)(0.02%)   1383859 98.9%  1.1%    0%    dput+0x30
 0.00% 0.88%  0.4us( 2.3us)  1.3us( 1.3us)(0.00%)       114 99.1% 0.88%    0%    link_path_walk+0x2d8
 0.00%    0%  0.9us( 1.6us)    0us                        2  100%    0%    0%    link_path_walk+0x678
 0.01%  4.4%  4.3us(6516us)  4.8us(  32us)(0.00%)      3566 95.6%  4.4%    0%    prune_dcache+0x14
 0.07%  2.3%  1.8us(6289us)  4.4us( 718us)(0.00%)     67591 97.7%  2.3%    0%    prune_dcache+0x150
 0.11% 0.79%   29us(4992us)   28us(1116us)(0.00%)      6444 99.2% 0.79%    0%    select_parent+0x24

One significant observation is that maximum lock hold time for prune_dcache() and select_parent() are high for this algorithm. However, these are not frequently done operation for this workload.

A comparison of baseline (2.4.16) kernel and lazy dcache can be seen in the following graphs -

The throughput results show marginal differences (statistically insignificant) for upto 4 CPUs, but of about 1% (statistically significant) on 8 CPUs. So, there is no performance regression in the lower end and the gains are really small in the higher end.

The detailed lockmeter outputs are - 1 CPU, 2 CPUs, 4 CPUs, 6 CPUs and 8 CPUs.

4.3.2 httperf results

The httperf results showed similar decrease in lock contention and lock utilization. With 8 CPUs, it showed significantly less contention -

SPINLOCKS         HOLD            WAIT
  1.4% 0.92%  0.7us( 577us)  2.2us( 617us)(0.00%)   4821866 99.1% 0.92%    0%  dcache_lock
 0.02%  2.2%  0.6us(  30us)  1.9us( 7.8us)(0.00%)    100031 97.8%  2.2%    0%    d_alloc+0x144
 0.01%  1.7%  0.2us(  12us)  2.2us( 9.2us)(0.00%)    100032 98.3%  1.7%    0%    d_instantiate+0x1c
 0.03%  1.5%  0.7us( 9.2us)  2.3us(  10us)(0.00%)    100031 98.5%  1.5%    0%    d_rehash+0x40
 0.24%  2.1%  1.2us( 577us)  1.9us( 283us)(0.00%)    521329 97.9%  2.1%    0%    dput+0x30
  1.1% 0.70%  0.7us( 366us)  2.4us( 617us)(0.00%)   4000443 99.3% 0.70%    0%    link_path_walk+0x2d8

A comparison of lockmeter statistics for httperf can be seen in the following graphs -

The detailed lockmeter outputs are - 1 CPU, 2 CPUs, 4 CPUs, 6 CPUs and 8 CPUs.

The results of httperf (replies/sec for fixed connection rate) showed statisticially insigficant difference between base 2.4.16 and lazy dcache kernels.

4.4 Comparison of SPECWeb99 results

Troy Wilson did a comparative study of some of the dcache patches using specweb99. His measurement identifies what is good for dentry cache to use. Here is a summary of his reasults. The measurements were done on a 8-CPU PIII Xeon server with more RAM than you can dream of getting in your desktop.

4.4.1 Throughput comparison

kernel throughput (simultaneous connections) % improvement
2.4.17+lse02E 2258 -
2.4.17+lse02E+fastwalk 2280 1%
2.4.17+lse02E+dcache_rcu 2530 12%

4.4.2 Lockmeter comparison


SPINLOCKS         HOLD            WAIT
 15.7% 20.8%  2.1us(6668us)   23us(  14ms)( 4.4%)   5215460 79.2% 20.8% 0%  dcache_lock


SPINLOCKS         HOLD            WAIT
 17.2% 17.7%  7.0us(  13ms)   53us(  30ms)( 2.9%)   1608566 82.3% 17.7% 0%  dcache_lock


SPINLOCKS         HOLD            WAIT
  1.9%  2.3%  2.0us(3343us)   71us(9406us)(0.20%)    657152 97.7%  2.3% 0%  dcache_lock

4.4.3 Conclusions

Fastwalk clearly helps by reducing the number of dcache_lock acquisitions by 69.9%. However, holding the lock over entire walk of the path is clearly detrimental to performance as seen by the 3-fold increase in average hold time and average wait time. dcache_rcu has the biggest positive impact on performance for webserver type of workload. Its gains come from two improvements - 87.3% reduction in lock acquisitions at the same time keeping the lock hold time constant.

4.5 2.5 Measurements

The dcache_rcu patch for 2.5 essentially does the same thing as in 2.4 that is faster d_lookup() by not taking the global dcache_lock and significantly brings down the contention for dcache_lock. In 2.5 version while pathwalk'ing the dcache_lock is not held and dentries are kept around by incrementing the ref count. This also helps in reducing the hold time for dcache_lock by more than 50%. The dcache-rcu patch for 2.5 kernel can be downlaod from Read-Copy Update package in LSE download section.

4.5.1 dbench numbers

Following is the throughput (MB/sec) comparison while running dbench for 12 clients between base (2.5.30) kernel and dcache_rcu-12.

# CPUS2.5.30dcache_rcu-12
Uni 82.405 82.65241
2 141.914 141.188
4 226.664 227.287
8 256.064 260.4548

Following is the throughput (MB/sec) comparison while running dbench for varying number of clients on an 8-cpu PIII Xeon with 16G RAM.

# clients2.5.30dcache_rcu-12
178.76 81.43
8241.64 246.8

4.5.2 Lockmeter comparison


SPINLOCKS         HOLD            WAIT
  3.3%  7.8%  2.1us(2437us)  4.8us(2415us)(0.02%)    309088 92.2%  7.8%    0%  dcache_lock


SPINLOCKS         HOLD            WAIT
  0.60%  1.1%  1.0us(1108us)  6.0us(1260us)(0.00%)    112929 98.9%  1.1%    0%  dcache_lock

The Detailed lockmeter reports are available at
Base (2.5.30)
Base (2.5.30) + dcache_rcu-12-2.5.30.patch


  1. dcache_rcu code cleanup and further tuning to improve performance.
  2. Simplify things in dcache_rcu using RCU list macros.
  3. Add lazy_lru to fastwalk and see if makes any difference
  4. Check if alternative locking (like brlock) makes any difference

SourceForge Logo