Scaling the dentry cache

Introduction:

Currently dentry cache is protected by a single global lock, dcache_lock. This lock is held during lookup of dentries as well as all manipulations to the dentry cache and the assorted lists for maintaining hierarchies, aliases and LRU entries. It doesn't seem to be an issue with lower-end SMP machines, but as the number of CPUs increase it becomes more of an issue. So we experimented with various ways to understand scaling issues related to dentry cache.

A look at the issues:

So far we have used two benchmarks for understanding scaling of dentry cache - dbench (with settings to avoid I/O as much as possible) and httperf. The latest numbers are based on 2.4.16 kernel. The numbers were measured in a 8-way P-III xeon (1MB L2 cache) box with 2 GB RAM.

dbench results:

The dbench results shows that lock utilization and contention, while not significant, grows steadily with increasing number of CPUs. 8-way results show 5.3% utilization with 16.5% contention on this lock.

One significant observation with the lockmeter output is that for this workload d_lookup() is the most commonly done operation.

SPINLOCKS         HOLD            WAIT
  UTIL  CON    MEAN(  MAX )   MEAN(  MAX )(% CPU)     TOTAL NOWAIT SPIN RJECT  NAME
  5.3% 16.5%  0.6us(2787us)  5.0us(3094us)(0.89%)  15069563 83.5% 16.5%    0%  dcache_lock
 0.01% 10.9%  0.2us( 7.5us)  5.3us( 116us)(0.00%)    119448 89.1% 10.9%    0%    d_alloc+0x128
 0.04% 14.2%  0.3us(  42us)  6.3us( 925us)(0.02%)    233290 85.8% 14.2%    0%    d_delete+0x10
 0.00%  3.5%  0.2us( 3.1us)  5.6us(  41us)(0.00%)      5050 96.5%  3.5%    0%    d_delete+0x94
 0.04% 10.9%  0.2us( 8.2us)  5.3us(1269us)(0.01%)    352739 89.1% 10.9%    0%    d_instantiate+0x1c
  4.8% 17.2%  0.7us(1362us)  4.8us(2692us)(0.76%)  12725262 82.8% 17.2%    0%    d_lookup+0x5c
 0.02% 11.0%  0.9us(  22us)  5.4us(1310us)(0.00%)     46800 89.0% 11.0%    0%    d_move+0x38
 0.01%  5.1%  0.2us(  37us)  4.2us(  84us)(0.00%)    119438 94.9%  5.1%    0%    d_rehash+0x40
 0.00%  2.5%  0.2us( 3.1us)  5.6us(  45us)(0.00%)      1680 97.5%  2.5%    0%    d_unhash+0x34
 0.31% 15.0%  0.4us(  64us)  6.2us(3094us)(0.09%)   1384623 85.0% 15.0%    0%    dput+0x30
 0.00% 0.82%  0.4us( 4.2us)  6.4us( 6.4us)(0.00%)       122 99.2% 0.82%    0%    link_path_walk+0x2a8
 0.00%    0%  1.7us( 1.8us)    0us                        2  100%    0%    0%    link_path_walk+0x618
 0.00%  6.4%  1.9us( 832us)  5.0us(  49us)(0.00%)      3630 93.6%  6.4%    0%    prune_dcache+0x14
 0.04%  9.4%  1.0us(1382us)  4.7us( 148us)(0.00%)     70974 90.6%  9.4%    0%    prune_dcache+0x138
 0.04%  4.2%   11us(2787us)  3.8us(  24us)(0.00%)      6505 95.8%  4.2%    0%    select_parent+0x20

This snippet of lockmeter output for 8-way shows that 84% of the time dcache_lock was acquired by d_lookup().

httperf results:

It shows a moderate utilization of 6.2% with 4.3% contention in an 8 CPU environment.

A snippet of lockmeter out put showing the distribution of acquisition of dcache_lock is as following -

SPINLOCKS         HOLD            WAIT
  UTIL  CON    MEAN(  MAX )   MEAN(  MAX )(% CPU)     TOTAL NOWAIT SPIN RJECT  NAME
  6.2%  4.3%  0.8us( 390us)  2.7us( 579us)(0.12%)  20243025 95.7%  4.3%    0%  dcache_lock
 0.02%  6.5%  0.5us(  45us)  2.7us( 281us)(0.00%)    100031 93.5%  6.5%    0%    d_alloc+0x128
 0.01%  4.9%  0.2us( 4.6us)  2.9us(  58us)(0.00%)    100032 95.1%  4.9%    0%    d_instantiate+0x1c
  5.0%  4.5%  0.8us( 387us)  2.8us( 579us)(0.09%)  15009129 95.5%  4.5%    0%    d_lookup+0x5c
 0.02%  5.8%  0.6us(  34us)  3.1us(  45us)(0.00%)    100031 94.2%  5.8%    0%    d_rehash+0x40
 0.19%  8.8%  0.5us( 296us)  2.8us( 315us)(0.01%)    933218 91.2%  8.8%    0%    dput+0x30
 0.89%  2.3%  0.6us( 390us)  2.5us( 309us)(0.01%)   4000584 97.7%  2.3%    0%    link_path_walk+0x2a8

This shows that 74% of the time the global lock is acquired from d_lookup().

Summary of baseline measurements:

The baseline measurements show that dcache_lock while not very hot shows increasing level of contention for some benchmarks. It is probably not an immediate bottleneck only due to other bottlenecks like BKL. While looking at the distribution of lock aqcuisitions for these workloads, it becomes obvious that d_lookup() is the routine to optimize since it is the routine from where the global lock is acquired most.

Avoiding global lock in d_lookup()

dcache_lock is held in d_lookup while traversing the d_hash list and to update the LRU list for freeing if the dentry found has zero ref count. By using RCU we can avoid dcache_lock while reading d_hash list. That is what the first RCU dcache patch dcache_rcu-2.4.10-01 did. In this, we were doing d_hash lookup lock free but had to take dcache_lock while updating LRU list. This patch does provide some decrease in lock holdtime and contention level to some extent. Lockmeter statistics for base kernel (2.4.16) while running dbench are here and for this minimal patch are here.

The RCU patch (rcu-2.4.10-1.patch) used for this approach can be found here

This proved that just doing the lock-free lookup of the dentry cache is not enough. d_lookup() also acquires dcache_lock to update the LRU list if the newly looked up entry previously had zero refcount. For dbench workload, most often this was true and hence we ended up acquiring the lock after almost every lock-free lookup of hash table in d_lookup(). We somehow needed to avoid acquiring dcache_lock so often. Different alogorithms were tried to get rid of this lock from d_lookup(). Here is the summary of the results.

1. Seperate lock for LRU list

Though this logic worked but the entire load got transferred to the LRU list lock and practically there was no gain at all. The idea behind this was that as d_lookup() olny updates the LRU list we can relax dcache_lock by introducing a seperate lock for LRU list. But the routines like prune_dcache, select_parent, d_prune_aliases etc read/write other lists also apart from LRU list..so at these places both locks were needed. dcache_rcu-lru_lock-2.4.16-02.patch. This patch is not tested completely except running dbench and collecting lockmeter statistics. (lockmeter statistics).

The RCU patch (rcu-2.4.16-1.patch) used for this approach can be found here

2. Per bucket lock for hash & LRU list

We tried hard to make this one work..but it resulted in lots of nasty and hard to debug race conditions..and it never worked properly. The fundamental problem is that dentry cache has several other lists that span across buckets and to modify/access any of those list we need to take multiple bucket locks. This results in a serious lock ordering problem and it turned to be unworkable. Also in this case same problem was there that both per bucket lock and dcache_lock were needed at many places...The entire logic was also becoming quite complicated and we could not get any statistics out of this as it never worked except booting. The approach was abandoned. dcache_rcu-bucket-2.4.16-05.patch

The RCU patch (rcu-2.4.16-1.patch) used for this approach can be found here

3. Lazy dentry cache

Given that lock-free traversal of hash chains is not enough for reducing dcache_lock acquisitions, we looked at the possibility of removing dcache_lock acquisitions completely from d_lookup(). It seemed that after using RCU based lock-free hash lookup, only reason dcache_lock was needed in d_lookup() was for updating the LRU list if the entry previously had zero refcount. We decided to the relax the LRU list norms by letting entries with non-zero refcount be in it and delay the updation of the LRU list. One logic behind this was that instead of acquiring the global dcache_lock for every single entry, we would process a number of entries per acquisition of the lock. To accomplish this, we introduced another flag in dentry to mark it and a per-entry lock to maintain consistency between the flag and the refcount. For all other fields, the refcount still provided the mutual exclusion. We also needed to make some decisions to ensure that we don't have too many entries in the LRU list with non-zero refcount so that the deletion process becomes very lengthy. For this we decided to try out two approaches - periodic updation of LRU list using a timer and periodic updation of LRU list during various traversals. Both these appraches use the same tactics in conjunction with RCU, they only differ in how the periodic updation is done.

The RCU patch (rcu-2.4.17-1.patch) used for this approach can be found here

Notes on lazy dcache implementation

Contention for dcache_lock is reduced in all routines..at the expense of prune_dcache. prune_dache & select_parent takes more time as d_lru list is longer. But that's fine as both these routines are not in fast path.

Per dentry lock(d_lock) is needed to protect the d_vfs_falgs and d_count in d_lookup. But that's fine as there is almost nill contention on per dentry lock. Now DCACHE_REFERENCED flag does some more work ie. it also indicates the dentries which are not supposed to be on d_lru list. Right now apart from d_lookup, per dentry lock (dentry->d_lock) is used where ever we are reading or modifying d_count or d_vfs_flags. Probably it is possible to tune the code more and relax the locking in some cases.

Due to lock hierarchy in dcache_lock and per dentry lock, dput code has become slightly ugly..and this also uses a d_nexthash field in the dentry structure because of deferred freeing of dentries using rcu.

Due to RCU, before unlinking dentry from d_hash list we have to update d_nexthash pointer and mark the dentry for deferred freeing by setting DCACHE_DEFERRED_FREE bit in d_vfs_flag.. so all this is done by a new function include/linux/dcache.h: d_unhash(). Also because of this is I changed the name for fs/namei.c: d_unhash() to fs/namei.c: d_vfs_unhash().

As we do lockless lookup, rmb() is used in d_lookup to avoid out of order read for d_nexthash and wmb() is used in d_unhash() to make sure that d_vfs_flags and d_nexthash() are updated before unlinking the dentry from d_hash chain.

The d_lru list is made uptodate through select_parent, prune_dcache and dput...means the dentries with non-zero ref count are removed from d_lru_list in those routines.

Now every dget() marks the dentry as referenced by setting DCACHE_REFERENCE bit in d_vfs_flags. So we have to take per dentry lock in dget always and there is no dget_locked.

Lazy dcache patches

dcache_rcu-2.5.3-2.patch
dcache_rcu-2.4.17-2.patch
dcache_rcu-lazy_lru-2.4.16-06.patch
dcache_rcu-lazy_lru-2.4.17-06.patch

For lazy LRU updation we also tried a timer based approach. In this a timer was used to remove the referenced dentries from LRU list so that LRU list can be kept short. But this does not prove any better than no-timer approach. Also as we have to take dcache_lock from the timer handler we had to use spin_lock_bh() nad spin_unlock_bh() for dcache_lock. This also created the prbolem of cyclic dependencies in dcache.h. But the patch is worth looking as probably proper tuning of timer frequency can give better results. dcache_rcu-lazy_lru-timer-2.4.16-04.patch

As of now the patche is tested with ext2, ext3, JFS and /proc filesystem. Our goal was only to experiment with dcache, testing it with other filesystems is in the pipleline.

Lazy dcache results

We ran dbench and httperf for understanding the effect of lazy dcache and results were very good. By doing a lock-free d_lookup(), we were able to substantially cut down the number of dcache_lock acquisitions. This resulted in substantially decreased contention as well as lock utilizations.

dbench results:

dbench results showed that lock utilization and contention levels remain flat with lazy dcache as opposed to steadily increasing with the baseline kernel. So for 8 processors, contention level is 0.95% as opposed to 16.5% for the baseline (2.4.16) kernel.

SPINLOCKS         HOLD            WAIT
  UTIL  CON    MEAN(  MAX )   MEAN(  MAX )(% CPU)     TOTAL NOWAIT SPIN RJECT  NAME
 0.89% 0.95%  0.6us(6516us)   19us(6411us)(0.03%)   2330127 99.0% 0.95%    0%  dcache_lock
 0.02%  1.7%  0.2us(  20us)   17us(2019us)(0.00%)    116150 98.3%  1.7%    0%    d_alloc+0x144
 0.03% 0.42%  0.2us(  49us)   35us(6033us)(0.00%)    233290 99.6% 0.42%    0%    d_delete+0x10
 0.00% 0.14%  0.8us(  12us)  3.4us( 8.5us)(0.00%)      5050 99.9% 0.14%    0%    d_delete+0x98
 0.03% 0.40%  0.1us(  32us)   34us(5251us)(0.00%)    349441 99.6% 0.40%    0%    d_instantiate+0x1c
 0.05% 0.30%  1.7us(  44us)   22us(1770us)(0.00%)     46800 99.7% 0.30%    0%    d_move+0x38
 0.01% 0.16%  0.1us(  21us)  4.5us( 334us)(0.00%)    116140 99.8% 0.16%    0%    d_rehash+0x40
 0.00% 0.65%  0.7us( 3.7us)  8.4us(  57us)(0.00%)      1680 99.3% 0.65%    0%    d_vfs_unhash+0x44
 0.56%  1.1%  0.7us(  84us)   18us(6411us)(0.02%)   1383859 98.9%  1.1%    0%    dput+0x30
 0.00% 0.88%  0.4us( 2.3us)  1.3us( 1.3us)(0.00%)       114 99.1% 0.88%    0%    link_path_walk+0x2d8
 0.00%    0%  0.9us( 1.6us)    0us                        2  100%    0%    0%    link_path_walk+0x678
 0.01%  4.4%  4.3us(6516us)  4.8us(  32us)(0.00%)      3566 95.6%  4.4%    0%    prune_dcache+0x14
 0.07%  2.3%  1.8us(6289us)  4.4us( 718us)(0.00%)     67591 97.7%  2.3%    0%    prune_dcache+0x150
 0.11% 0.79%   29us(4992us)   28us(1116us)(0.00%)      6444 99.2% 0.79%    0%    select_parent+0x24

One significant observation is that maximum lock hold time for prune_dcache() and select_parent() are high for this algorithm. However, these are not frequently done operation for this workload.

A comparison of baseline (2.4.16) kernel and lazy dcache can be seen in the following graphs -

The throughput results show marginal differences (statistically insignificant) for upto 4 CPUs, but of about 1% (statistically significant) on 8 CPUs. So, there is no performance regression in the lower end and the gains are really small in the higher end. Along with decreased contention for dcache_lock in the higher end we see increased contention for BKL. So, it is likely that other bottlenecks negate the gain made by rcu_lazy_dcache.

The detailed lockmeter outputs are - 1 CPU, 2 CPUs, 4 CPUs, 6 CPUs and 8 CPUs.

httperf results:

The httperf results showed similar decrease in lock contention and lock utilization. With 8 CPUs, it showed significantly less contention -

SPINLOCKS         HOLD            WAIT
  UTIL  CON    MEAN(  MAX )   MEAN(  MAX )(% CPU)     TOTAL NOWAIT SPIN RJECT  NAME
  1.4% 0.92%  0.7us( 577us)  2.2us( 617us)(0.00%)   4821866 99.1% 0.92%    0%  dcache_lock
 0.02%  2.2%  0.6us(  30us)  1.9us( 7.8us)(0.00%)    100031 97.8%  2.2%    0%    d_alloc+0x144
 0.01%  1.7%  0.2us(  12us)  2.2us( 9.2us)(0.00%)    100032 98.3%  1.7%    0%    d_instantiate+0x1c
 0.03%  1.5%  0.7us( 9.2us)  2.3us(  10us)(0.00%)    100031 98.5%  1.5%    0%    d_rehash+0x40
 0.24%  2.1%  1.2us( 577us)  1.9us( 283us)(0.00%)    521329 97.9%  2.1%    0%    dput+0x30
  1.1% 0.70%  0.7us( 366us)  2.4us( 617us)(0.00%)   4000443 99.3% 0.70%    0%    link_path_walk+0x2d8

A comparison of lockmeter statistics for httperf can be seen in the following graphs -

The detailed lockmeter outputs are - 1 CPU, 2 CPUs, 4 CPUs, 6 CPUs and 8 CPUs.

The results of httperf (replies/sec for fixed connection rate) showed statisticially insigficant difference between base 2.4.16 and lazy dcache kernels.

Note:

All dcache patches need RCU patche corresponding kernel version. Like for dcache_rcu-2.5.3-x.patch first apply rcu-2.5.3-x.patch or rcu_poll-2.5.3-x.patch. RCU and dcache patches can be downloaded from Read-Copy Update package in LSE