The directory entry cache is used to facilitate quick lookup of inodes from file paths. The basic lookup is done through a hash table, but the entries are also linked together in several other lists providing various information like hierarchy, alias and aging.
Currently dentry cache is protected by a single global lock, dcache_lock. This lock is held during all operations on the dentry hash table. It is also used to protect all other lists (d_child, d_alias, d_lru etc.) that connects the directory entries. It doesn't seem to be an issue with lower-end SMP machines, but as the number of CPUs increase, filesystem intensive benchmarks show sign that dcache_lock is contentious . We measured the scalability of directory entry cache using dbench, httperf and specweb99 benchmarks and experimented with various ways to enhance its performance.
So far we have used 3 benchmarks for understanding scaling of dentry cache - SPECWeb99, dbench (with settings to avoid I/O as much as possible) and httperf.
SPEC(tm) and the benchmark name SPECweb(tm) are registered trademarks of the Standard Performance Evaluation Corporation. The benchmarking were done for research purpose only and were non-compliant with the following devaitions from the rules -
Kernel : 2.4.17 +
lse02E
Throughput : 2258 simultaneous connections
SPINLOCKS HOLD WAIT UTIL CON MEAN( MAX ) MEAN( MAX )(% CPU) TOTAL NOWAIT SPIN RJECT NAME 15.7% 20.8% 2.1us(6668us) 23us( 14ms)( 4.4%) 5215460 79.2% 20.8% 0% dcache_lock 0.19% 19.4% 1.7us(1327us) 66us( 13ms)(0.17%) 75355 80.6% 19.4% 0% d_alloc+0x130 0.12% 20.8% 1.1us(1225us) 16us(2907us)(0.05%) 75353 79.2% 20.8% 0% d_instantiate+0x1c 13.5% 20.9% 2.1us(6668us) 20us( 13ms)( 3.3%) 4433721 79.1% 20.9% 0% d_lookup+0x58 0.16% 14.1% 1.5us(1065us) 17us(1930us)(0.03%) 75353 85.9% 14.1% 0% d_rehash+0x3c 1.7% 21.0% 2.1us(1119us) 40us( 14ms)(0.84%) 555661 79.0% 21.0% 0% dput+0x30 0.00% 5.9% 2.2us( 5.8us) 5.5us( 5.5us)(0.00%) 17 94.1% 5.9% 0% link_path_walk+0x28cKernel profiling output shows that directory entry cache operations are taking significant amount of time -
176 dput 0.5432 28 d_invalidate 0.2000 46 d_alloc 0.1223 12 d_instantiate 0.1765 1293 d_lookup 4.4896 19 d_rehash 0.2159This clearly indicates that the global locking scheme for the directory entry cache is a problem and needs to be solved immediately.
The dbench results shows that lock utilization and contention, while not significant, grows steadily with increasing number of CPUs. 8-CPU SMP results show 5.3% utilization with 16.5% contention on this lock.
The latest numbers are based on 2.4.16 kernel. The numbers were measured in a 8-CPU P-III xeon (1MB L2 cache) box with 2 GB RAM.
One significant observation with the lockmeter output is that for this workload d_lookup() is the most commonly done operation.
SPINLOCKS HOLD WAIT UTIL CON MEAN( MAX ) MEAN( MAX )(% CPU) TOTAL NOWAIT SPIN RJECT NAME 5.3% 16.5% 0.6us(2787us) 5.0us(3094us)(0.89%) 15069563 83.5% 16.5% 0% dcache_lock 0.01% 10.9% 0.2us( 7.5us) 5.3us( 116us)(0.00%) 119448 89.1% 10.9% 0% d_alloc+0x128 0.04% 14.2% 0.3us( 42us) 6.3us( 925us)(0.02%) 233290 85.8% 14.2% 0% d_delete+0x10 0.00% 3.5% 0.2us( 3.1us) 5.6us( 41us)(0.00%) 5050 96.5% 3.5% 0% d_delete+0x94 0.04% 10.9% 0.2us( 8.2us) 5.3us(1269us)(0.01%) 352739 89.1% 10.9% 0% d_instantiate+0x1c 4.8% 17.2% 0.7us(1362us) 4.8us(2692us)(0.76%) 12725262 82.8% 17.2% 0% d_lookup+0x5c 0.02% 11.0% 0.9us( 22us) 5.4us(1310us)(0.00%) 46800 89.0% 11.0% 0% d_move+0x38 0.01% 5.1% 0.2us( 37us) 4.2us( 84us)(0.00%) 119438 94.9% 5.1% 0% d_rehash+0x40 0.00% 2.5% 0.2us( 3.1us) 5.6us( 45us)(0.00%) 1680 97.5% 2.5% 0% d_unhash+0x34 0.31% 15.0% 0.4us( 64us) 6.2us(3094us)(0.09%) 1384623 85.0% 15.0% 0% dput+0x30 0.00% 0.82% 0.4us( 4.2us) 6.4us( 6.4us)(0.00%) 122 99.2% 0.82% 0% link_path_walk+0x2a8 0.00% 0% 1.7us( 1.8us) 0us 2 100% 0% 0% link_path_walk+0x618 0.00% 6.4% 1.9us( 832us) 5.0us( 49us)(0.00%) 3630 93.6% 6.4% 0% prune_dcache+0x14 0.04% 9.4% 1.0us(1382us) 4.7us( 148us)(0.00%) 70974 90.6% 9.4% 0% prune_dcache+0x138 0.04% 4.2% 11us(2787us) 3.8us( 24us)(0.00%) 6505 95.8% 4.2% 0% select_parent+0x20
This snippet of lockmeter output for 8-CPU SMP shows that 84% of the time dcache_lock was acquired by d_lookup().
The latest numbers are based on 2.4.16 kernel. The numbers were measured in a 8-CPU P-III xeon (1MB L2 cache) box with 2 GB RAM.
A snippet of lockmeter out put showing the distribution of acquisition of dcache_lock is as following -
SPINLOCKS HOLD WAIT UTIL CON MEAN( MAX ) MEAN( MAX )(% CPU) TOTAL NOWAIT SPIN RJECT NAME 6.2% 4.3% 0.8us( 390us) 2.7us( 579us)(0.12%) 20243025 95.7% 4.3% 0% dcache_lock 0.02% 6.5% 0.5us( 45us) 2.7us( 281us)(0.00%) 100031 93.5% 6.5% 0% d_alloc+0x128 0.01% 4.9% 0.2us( 4.6us) 2.9us( 58us)(0.00%) 100032 95.1% 4.9% 0% d_instantiate+0x1c 5.0% 4.5% 0.8us( 387us) 2.8us( 579us)(0.09%) 15009129 95.5% 4.5% 0% d_lookup+0x5c 0.02% 5.8% 0.6us( 34us) 3.1us( 45us)(0.00%) 100031 94.2% 5.8% 0% d_rehash+0x40 0.19% 8.8% 0.5us( 296us) 2.8us( 315us)(0.01%) 933218 91.2% 8.8% 0% dput+0x30 0.89% 2.3% 0.6us( 390us) 2.5us( 309us)(0.01%) 4000584 97.7% 2.3% 0% link_path_walk+0x2a8
This shows that 74% of the time the global lock is acquired from d_lookup().
A common locking scheme in hash tables is to use per-bucket locks. We tried to use this approach with the dentry cache, but it resulted in lots of nasty and hard to debug race conditions and it never worked properly. The fundamental problem is that dentry cache has several other lists that span across buckets and to modify/access any of those list we need to take multiple bucket locks. This results in a serious lock ordering problem and it turned out to be unusable. Several variations were tried including one where per-bucket lock protected the hash chain and the global dcache_lock protected the rest. All these logics resulted in very complicated code and they never worked satisfactorily. Therefore this approach was abandoned.
As the base line lockmeter data shows, a large part of the dcache operations involve looking up the hash table. Based on our earlier presentaion of this data in linux-kernel mailing list, Mark Hahn had suggested that we try big-reader lock. There is still a significant number of updates to directory entries, so on the surface big-reader lock didn't seem suitable. Nevertheless, in order to complete our search for solutions, we implemented a big-reader lock based directory cache. The implementation is very simple, the dcache_lock is just replaced by BRLOCK_DCACHE. Places like d_lookup where we just read from the dentry cache, we use br_read_lock(), elsewhere br_write_lock() is used. The following situations were handled differently -
dcache_lock is held in d_lookup() while traversing the d_hash list and to update the LRU list for freeing if the dentry found has zero ref count. By using RCU we can avoid dcache_lock while reading d_hash list. That is what the first RCU dcache patch dcache_rcu-2.4.10-01 did. In this, we were doing d_hash lookup lock-free but had to acquire dcache_lock while updating the LRU list. This patch does provide some decrease in lock holdtime and contention level to some extent. Lockmeter statistics for base kernel (2.4.16) while running dbench are here and for this minimal patch are here.
This proved that just doing the lock-free lookup of the dentry cache is not enough. d_lookup() also acquires dcache_lock to update the LRU list if the newly looked up entry previously had zero reference count. For dbench workload, most often, this was true and hence we ended up acquiring the lock after almost every lock-free lookup of hash table in d_lookup(). We somehow needed to avoid acquiring dcache_lock so often. Different alogorithms were tried to get rid of this lock from d_lookup(). Here is the summary of the results.
One fundamental realization that comes out of these experiments is that it is better to keep the dcache_lock around and go for optimizations around it. This keeps things simple.
Given that lock-free traversal of hash chains is not enough for reducing dcache_lock acquisitions, we looked at the possibility of removing dcache_lock acquisitions completely from d_lookup(). It seemed that after using RCU based lock-free hash lookup, only reason dcache_lock was needed in d_lookup() was for updating the LRU list if the entry previously had zero refcount. We decided to the relax the LRU list norms by letting entries with non-zero refcount be in it and delay the updation of the LRU list. The logic behind this was that instead of acquiring the global dcache_lock for every single entry, we would process a number of entries per acquisition of the lock. To accomplish this, we introduced another flag in dentry to mark it and a per-entry lock to maintain consistency between the flag and the refcount. For all other fields, the refcount still provided the mutual exclusion. We also needed to make some decisions to ensure that we don't have too many entries in the LRU list with non-zero refcount so that the deletion process becomes very lengthy. For this we decided to try out two approaches - periodic updation of LRU list using a timer and periodic updation of LRU list during various traversals. Both these appraches use the same tactics in conjunction with RCU, they only differ in how the periodic updation is done.
For lazy LRU updation we also tried a timer based approach. In this, a timer was used to remove the referenced dentries from LRU list so that LRU list can be kept short. But this does not prove any better than no-timer approach. Also as we have to take dcache_lock from the timer handler we had to use spin_lock_bh() nad spin_unlock_bh() for dcache_lock. This also created the prbolem of cyclic dependencies in dcache.h. But the patch is worth looking as probably proper tuning of timer frequency can give better results. dcache_rcu-lazy_lru-timer-2.4.16-04.patch
The dcache_rcu-lazy_lru patch has been renamed as dcache_rcu since that is the final algorithm that we chose. dcache_rcu patches are available for download from Read-Copy Update package in LSE download section.
SPINLOCKS HOLD WAIT UTIL CON MEAN( MAX ) MEAN( MAX )(% CPU) TOTAL NOWAIT SPIN RJECT NAME 0.89% 0.95% 0.6us(6516us) 19us(6411us)(0.03%) 2330127 99.0% 0.95% 0% dcache_lock 0.02% 1.7% 0.2us( 20us) 17us(2019us)(0.00%) 116150 98.3% 1.7% 0% d_alloc+0x144 0.03% 0.42% 0.2us( 49us) 35us(6033us)(0.00%) 233290 99.6% 0.42% 0% d_delete+0x10 0.00% 0.14% 0.8us( 12us) 3.4us( 8.5us)(0.00%) 5050 99.9% 0.14% 0% d_delete+0x98 0.03% 0.40% 0.1us( 32us) 34us(5251us)(0.00%) 349441 99.6% 0.40% 0% d_instantiate+0x1c 0.05% 0.30% 1.7us( 44us) 22us(1770us)(0.00%) 46800 99.7% 0.30% 0% d_move+0x38 0.01% 0.16% 0.1us( 21us) 4.5us( 334us)(0.00%) 116140 99.8% 0.16% 0% d_rehash+0x40 0.00% 0.65% 0.7us( 3.7us) 8.4us( 57us)(0.00%) 1680 99.3% 0.65% 0% d_vfs_unhash+0x44 0.56% 1.1% 0.7us( 84us) 18us(6411us)(0.02%) 1383859 98.9% 1.1% 0% dput+0x30 0.00% 0.88% 0.4us( 2.3us) 1.3us( 1.3us)(0.00%) 114 99.1% 0.88% 0% link_path_walk+0x2d8 0.00% 0% 0.9us( 1.6us) 0us 2 100% 0% 0% link_path_walk+0x678 0.01% 4.4% 4.3us(6516us) 4.8us( 32us)(0.00%) 3566 95.6% 4.4% 0% prune_dcache+0x14 0.07% 2.3% 1.8us(6289us) 4.4us( 718us)(0.00%) 67591 97.7% 2.3% 0% prune_dcache+0x150 0.11% 0.79% 29us(4992us) 28us(1116us)(0.00%) 6444 99.2% 0.79% 0% select_parent+0x24
One significant observation is that maximum lock hold time for prune_dcache() and select_parent() are high for this algorithm. However, these are not frequently done operation for this workload.
A comparison of baseline (2.4.16) kernel and lazy dcache can be seen in the following graphs -
The throughput results show marginal differences (statistically insignificant) for upto 4 CPUs, but of about 1% (statistically significant) on 8 CPUs. So, there is no performance regression in the lower end and the gains are really small in the higher end.
The detailed lockmeter outputs are - 1 CPU, 2 CPUs, 4 CPUs, 6 CPUs and 8 CPUs.
SPINLOCKS HOLD WAIT UTIL CON MEAN( MAX ) MEAN( MAX )(% CPU) TOTAL NOWAIT SPIN RJECT NAME 1.4% 0.92% 0.7us( 577us) 2.2us( 617us)(0.00%) 4821866 99.1% 0.92% 0% dcache_lock 0.02% 2.2% 0.6us( 30us) 1.9us( 7.8us)(0.00%) 100031 97.8% 2.2% 0% d_alloc+0x144 0.01% 1.7% 0.2us( 12us) 2.2us( 9.2us)(0.00%) 100032 98.3% 1.7% 0% d_instantiate+0x1c 0.03% 1.5% 0.7us( 9.2us) 2.3us( 10us)(0.00%) 100031 98.5% 1.5% 0% d_rehash+0x40 0.24% 2.1% 1.2us( 577us) 1.9us( 283us)(0.00%) 521329 97.9% 2.1% 0% dput+0x30 1.1% 0.70% 0.7us( 366us) 2.4us( 617us)(0.00%) 4000443 99.3% 0.70% 0% link_path_walk+0x2d8
A comparison of lockmeter statistics for httperf can be seen in the following graphs -
The detailed lockmeter outputs are - 1 CPU, 2 CPUs, 4 CPUs, 6 CPUs and 8 CPUs.
The results of httperf (replies/sec for fixed connection rate) showed statisticially insigficant difference between base 2.4.16 and lazy dcache kernels.
Troy Wilson did a comparative study of some of the dcache patches using specweb99. His measurement identifies what is good for dentry cache to use. Here is a summary of his reasults. The measurements were done on a 8-CPU PIII Xeon server with more RAM than you can dream of getting in your desktop.
kernel | throughput (simultaneous connections) | % improvement |
---|---|---|
2.4.17+lse02E | 2258 | - |
2.4.17+lse02E+fastwalk | 2280 | 1% |
2.4.17+lse02E+dcache_rcu | 2530 | 12% |
2.4.17+lse02E:
SPINLOCKS HOLD WAIT UTIL CON MEAN( MAX ) MEAN( MAX )(% CPU) TOTAL NOWAIT SPIN RJECT NAME 15.7% 20.8% 2.1us(6668us) 23us( 14ms)( 4.4%) 5215460 79.2% 20.8% 0% dcache_lock
2.4.17+lse02E+fastwalk:
SPINLOCKS HOLD WAIT UTIL CON MEAN( MAX ) MEAN( MAX )(% CPU) TOTAL NOWAIT SPIN RJECT NAME 17.2% 17.7% 7.0us( 13ms) 53us( 30ms)( 2.9%) 1608566 82.3% 17.7% 0% dcache_lock
2.4.17+lse02E+dcache_rcu:
SPINLOCKS HOLD WAIT UTIL CON MEAN( MAX ) MEAN( MAX )(% CPU) TOTAL NOWAIT SPIN RJECT NAME 1.9% 2.3% 2.0us(3343us) 71us(9406us)(0.20%) 657152 97.7% 2.3% 0% dcache_lock
Fastwalk clearly helps by reducing the number of dcache_lock acquisitions by 69.9%. However, holding the lock over entire walk of the path is clearly detrimental to performance as seen by the 3-fold increase in average hold time and average wait time. dcache_rcu has the biggest positive impact on performance for webserver type of workload. Its gains come from two improvements - 87.3% reduction in lock acquisitions at the same time keeping the lock hold time constant.
The dcache_rcu patch for 2.5 essentially does the same thing as in 2.4 that is faster d_lookup() by not taking the global dcache_lock and significantly brings down the contention for dcache_lock. In 2.5 version while pathwalk'ing the dcache_lock is not held and dentries are kept around by incrementing the ref count. This also helps in reducing the hold time for dcache_lock by more than 50%. The dcache-rcu patch for 2.5 kernel can be downlaod from Read-Copy Update package in LSE download section.
Following is the throughput (MB/sec) comparison while running dbench for 12 clients between base (2.5.30) kernel and dcache_rcu-12.
|
Following is the throughput (MB/sec) comparison while running dbench for varying number of clients on an 8-cpu PIII Xeon with 16G RAM.
|
2.5.30
SPINLOCKS HOLD WAIT UTIL CON MEAN( MAX ) MEAN( MAX )(% CPU) TOTAL NOWAIT SPIN RJECT NAME 3.3% 7.8% 2.1us(2437us) 4.8us(2415us)(0.02%) 309088 92.2% 7.8% 0% dcache_lock
2.5.30+dcache_rcu-12-1.5.30:
SPINLOCKS HOLD WAIT UTIL CON MEAN( MAX ) MEAN( MAX )(% CPU) TOTAL NOWAIT SPIN RJECT NAME 0.60% 1.1% 1.0us(1108us) 6.0us(1260us)(0.00%) 112929 98.9% 1.1% 0% dcache_lock
The Detailed lockmeter reports are available at
Base (2.5.30)
Base (2.5.30) + dcache_rcu-12-2.5.30.patch