One significant observation with the lockmeter output is that for this workload d_lookup() is the most commonly done operation.
SPINLOCKS HOLD WAIT UTIL CON MEAN( MAX ) MEAN( MAX )(% CPU) TOTAL NOWAIT SPIN RJECT NAME 5.3% 16.5% 0.6us(2787us) 5.0us(3094us)(0.89%) 15069563 83.5% 16.5% 0% dcache_lock 0.01% 10.9% 0.2us( 7.5us) 5.3us( 116us)(0.00%) 119448 89.1% 10.9% 0% d_alloc+0x128 0.04% 14.2% 0.3us( 42us) 6.3us( 925us)(0.02%) 233290 85.8% 14.2% 0% d_delete+0x10 0.00% 3.5% 0.2us( 3.1us) 5.6us( 41us)(0.00%) 5050 96.5% 3.5% 0% d_delete+0x94 0.04% 10.9% 0.2us( 8.2us) 5.3us(1269us)(0.01%) 352739 89.1% 10.9% 0% d_instantiate+0x1c 4.8% 17.2% 0.7us(1362us) 4.8us(2692us)(0.76%) 12725262 82.8% 17.2% 0% d_lookup+0x5c 0.02% 11.0% 0.9us( 22us) 5.4us(1310us)(0.00%) 46800 89.0% 11.0% 0% d_move+0x38 0.01% 5.1% 0.2us( 37us) 4.2us( 84us)(0.00%) 119438 94.9% 5.1% 0% d_rehash+0x40 0.00% 2.5% 0.2us( 3.1us) 5.6us( 45us)(0.00%) 1680 97.5% 2.5% 0% d_unhash+0x34 0.31% 15.0% 0.4us( 64us) 6.2us(3094us)(0.09%) 1384623 85.0% 15.0% 0% dput+0x30 0.00% 0.82% 0.4us( 4.2us) 6.4us( 6.4us)(0.00%) 122 99.2% 0.82% 0% link_path_walk+0x2a8 0.00% 0% 1.7us( 1.8us) 0us 2 100% 0% 0% link_path_walk+0x618 0.00% 6.4% 1.9us( 832us) 5.0us( 49us)(0.00%) 3630 93.6% 6.4% 0% prune_dcache+0x14 0.04% 9.4% 1.0us(1382us) 4.7us( 148us)(0.00%) 70974 90.6% 9.4% 0% prune_dcache+0x138 0.04% 4.2% 11us(2787us) 3.8us( 24us)(0.00%) 6505 95.8% 4.2% 0% select_parent+0x20
This snippet of lockmeter output for 8-way shows that 84% of the time dcache_lock was acquired by d_lookup().
A snippet of lockmeter out put showing the distribution of acquisition of dcache_lock is as following -
SPINLOCKS HOLD WAIT UTIL CON MEAN( MAX ) MEAN( MAX )(% CPU) TOTAL NOWAIT SPIN RJECT NAME 6.2% 4.3% 0.8us( 390us) 2.7us( 579us)(0.12%) 20243025 95.7% 4.3% 0% dcache_lock 0.02% 6.5% 0.5us( 45us) 2.7us( 281us)(0.00%) 100031 93.5% 6.5% 0% d_alloc+0x128 0.01% 4.9% 0.2us( 4.6us) 2.9us( 58us)(0.00%) 100032 95.1% 4.9% 0% d_instantiate+0x1c 5.0% 4.5% 0.8us( 387us) 2.8us( 579us)(0.09%) 15009129 95.5% 4.5% 0% d_lookup+0x5c 0.02% 5.8% 0.6us( 34us) 3.1us( 45us)(0.00%) 100031 94.2% 5.8% 0% d_rehash+0x40 0.19% 8.8% 0.5us( 296us) 2.8us( 315us)(0.01%) 933218 91.2% 8.8% 0% dput+0x30 0.89% 2.3% 0.6us( 390us) 2.5us( 309us)(0.01%) 4000584 97.7% 2.3% 0% link_path_walk+0x2a8
This shows that 74% of the time the global lock is acquired from d_lookup().
dcache_lock is held in d_lookup while traversing the d_hash list and to update the LRU list for freeing if the dentry found has zero ref count. By using RCU we can avoid dcache_lock while reading d_hash list. That is what the first RCU dcache patch dcache_rcu-2.4.10-01 did. In this, we were doing d_hash lookup lock free but had to take dcache_lock while updating LRU list. This patch does provide some decrease in lock holdtime and contention level to some extent. Lockmeter statistics for base kernel (2.4.16) while running dbench are here and for this minimal patch are here.
The RCU patch (rcu-2.4.10-1.patch) used for this approach can be found here
This proved that just doing the lock-free lookup of the dentry cache is not enough. d_lookup() also acquires dcache_lock to update the LRU list if the newly looked up entry previously had zero refcount. For dbench workload, most often this was true and hence we ended up acquiring the lock after almost every lock-free lookup of hash table in d_lookup(). We somehow needed to avoid acquiring dcache_lock so often. Different alogorithms were tried to get rid of this lock from d_lookup(). Here is the summary of the results.
The RCU patch (rcu-2.4.16-1.patch) used for this approach can be found here
The RCU patch (rcu-2.4.16-1.patch) used for this approach can be found here
The RCU patch (rcu-2.4.17-1.patch) used for this approach can be found here
Contention for dcache_lock is reduced in all routines..at the expense of prune_dcache. prune_dache & select_parent takes more time as d_lru list is longer. But that's fine as both these routines are not in fast path.
Per dentry lock(d_lock) is needed to protect the d_vfs_falgs and d_count in d_lookup. But that's fine as there is almost nill contention on per dentry lock. Now DCACHE_REFERENCED flag does some more work ie. it also indicates the dentries which are not supposed to be on d_lru list. Right now apart from d_lookup, per dentry lock (dentry->d_lock) is used where ever we are reading or modifying d_count or d_vfs_flags. Probably it is possible to tune the code more and relax the locking in some cases.
Due to lock hierarchy in dcache_lock and per dentry lock, dput code has become slightly ugly..and this also uses a d_nexthash field in the dentry structure because of deferred freeing of dentries using rcu.
Due to RCU, before unlinking dentry from d_hash list we have to update d_nexthash pointer and mark the dentry for deferred freeing by setting DCACHE_DEFERRED_FREE bit in d_vfs_flag.. so all this is done by a new function include/linux/dcache.h: d_unhash(). Also because of this is I changed the name for fs/namei.c: d_unhash() to fs/namei.c: d_vfs_unhash().
As we do lockless lookup, rmb() is used in d_lookup to avoid out of order read for d_nexthash and wmb() is used in d_unhash() to make sure that d_vfs_flags and d_nexthash() are updated before unlinking the dentry from d_hash chain.
The d_lru list is made uptodate through select_parent, prune_dcache and dput...means the dentries with non-zero ref count are removed from d_lru_list in those routines.
Now every dget() marks the dentry as referenced by setting DCACHE_REFERENCE bit in d_vfs_flags. So we have to take per dentry lock in dget always and there is no dget_locked.
For lazy LRU updation we also tried a timer based approach. In this a timer was used to remove the referenced dentries from LRU list so that LRU list can be kept short. But this does not prove any better than no-timer approach. Also as we have to take dcache_lock from the timer handler we had to use spin_lock_bh() nad spin_unlock_bh() for dcache_lock. This also created the prbolem of cyclic dependencies in dcache.h. But the patch is worth looking as probably proper tuning of timer frequency can give better results. dcache_rcu-lazy_lru-timer-2.4.16-04.patch
As of now the patche is tested with ext2, ext3, JFS and /proc filesystem. Our goal was only to experiment with dcache, testing it with other filesystems is in the pipleline.
SPINLOCKS HOLD WAIT UTIL CON MEAN( MAX ) MEAN( MAX )(% CPU) TOTAL NOWAIT SPIN RJECT NAME 0.89% 0.95% 0.6us(6516us) 19us(6411us)(0.03%) 2330127 99.0% 0.95% 0% dcache_lock 0.02% 1.7% 0.2us( 20us) 17us(2019us)(0.00%) 116150 98.3% 1.7% 0% d_alloc+0x144 0.03% 0.42% 0.2us( 49us) 35us(6033us)(0.00%) 233290 99.6% 0.42% 0% d_delete+0x10 0.00% 0.14% 0.8us( 12us) 3.4us( 8.5us)(0.00%) 5050 99.9% 0.14% 0% d_delete+0x98 0.03% 0.40% 0.1us( 32us) 34us(5251us)(0.00%) 349441 99.6% 0.40% 0% d_instantiate+0x1c 0.05% 0.30% 1.7us( 44us) 22us(1770us)(0.00%) 46800 99.7% 0.30% 0% d_move+0x38 0.01% 0.16% 0.1us( 21us) 4.5us( 334us)(0.00%) 116140 99.8% 0.16% 0% d_rehash+0x40 0.00% 0.65% 0.7us( 3.7us) 8.4us( 57us)(0.00%) 1680 99.3% 0.65% 0% d_vfs_unhash+0x44 0.56% 1.1% 0.7us( 84us) 18us(6411us)(0.02%) 1383859 98.9% 1.1% 0% dput+0x30 0.00% 0.88% 0.4us( 2.3us) 1.3us( 1.3us)(0.00%) 114 99.1% 0.88% 0% link_path_walk+0x2d8 0.00% 0% 0.9us( 1.6us) 0us 2 100% 0% 0% link_path_walk+0x678 0.01% 4.4% 4.3us(6516us) 4.8us( 32us)(0.00%) 3566 95.6% 4.4% 0% prune_dcache+0x14 0.07% 2.3% 1.8us(6289us) 4.4us( 718us)(0.00%) 67591 97.7% 2.3% 0% prune_dcache+0x150 0.11% 0.79% 29us(4992us) 28us(1116us)(0.00%) 6444 99.2% 0.79% 0% select_parent+0x24
One significant observation is that maximum lock hold time for prune_dcache() and select_parent() are high for this algorithm. However, these are not frequently done operation for this workload.
A comparison of baseline (2.4.16) kernel and lazy dcache can be seen in the following graphs -
The throughput results show marginal differences (statistically insignificant) for upto 4 CPUs, but of about 1% (statistically significant) on 8 CPUs. So, there is no performance regression in the lower end and the gains are really small in the higher end. Along with decreased contention for dcache_lock in the higher end we see increased contention for BKL. So, it is likely that other bottlenecks negate the gain made by rcu_lazy_dcache.
The detailed lockmeter outputs are - 1 CPU, 2 CPUs, 4 CPUs, 6 CPUs and 8 CPUs.
SPINLOCKS HOLD WAIT UTIL CON MEAN( MAX ) MEAN( MAX )(% CPU) TOTAL NOWAIT SPIN RJECT NAME 1.4% 0.92% 0.7us( 577us) 2.2us( 617us)(0.00%) 4821866 99.1% 0.92% 0% dcache_lock 0.02% 2.2% 0.6us( 30us) 1.9us( 7.8us)(0.00%) 100031 97.8% 2.2% 0% d_alloc+0x144 0.01% 1.7% 0.2us( 12us) 2.2us( 9.2us)(0.00%) 100032 98.3% 1.7% 0% d_instantiate+0x1c 0.03% 1.5% 0.7us( 9.2us) 2.3us( 10us)(0.00%) 100031 98.5% 1.5% 0% d_rehash+0x40 0.24% 2.1% 1.2us( 577us) 1.9us( 283us)(0.00%) 521329 97.9% 2.1% 0% dput+0x30 1.1% 0.70% 0.7us( 366us) 2.4us( 617us)(0.00%) 4000443 99.3% 0.70% 0% link_path_walk+0x2d8
A comparison of lockmeter statistics for httperf can be seen in the following graphs -
The detailed lockmeter outputs are - 1 CPU, 2 CPUs, 4 CPUs, 6 CPUs and 8 CPUs.
The results of httperf (replies/sec for fixed connection rate) showed statisticially insigficant difference between base 2.4.16 and lazy dcache kernels.
All dcache patches need RCU patche corresponding kernel version. Like for dcache_rcu-2.5.3-x.patch first apply rcu-2.5.3-x.patch or rcu_poll-2.5.3-x.patch. RCU and dcache patches can be downloaded from Read-Copy Update package in LSE