BKL Removal Patches for 2.4.17

These are patches which enhance SMP Linux performance by changing the Big Kernel Lock's behavior. Many of these patches were accepted for use in 2.5 but not 2.4. The target for these patches is 2.4.17.

I have several other BKL patches which are much more cleanup than performance oriented and are not covered here.

Stability Note:
I'm confident about all of these except the notify_change patch. There have been some fixes and other issues with it in 2.5. It probably needs a bit more time to get the bugs out in 2.5.



Rollup patch - download

This patch combines all those listed below into a single patch
Lockmeter is a tool written by SGI to evaluate spinlock and rwlock performance in the Linux kernel. It records data describing lock contention and hold times. I have recorded the differences in performance of 2.4.17 running dbench with, and without, the rollup patch.

These data have been trimmed down a little bit, to save space. The complete data are available here, under Global Lock Reduction -> Lockmeter Data.

The original 2.4.17 had horrible hold times in do_exit(), 55ms!! That is a very long time.

SPINLOCKS         HOLD            WAIT
  UTIL  CON    MEAN(  MAX )   MEAN(  MAX )(% CPU)     TOTAL NOWAIT SPIN RJECT  NAME
 0.76% 22.8% 7711us(  55ms) 2363us(  37ms)(0.01%)       193 77.2% 22.8%    0%    do_exit+0xd8

Look at the difference with the rollup applied. First of all, do_exit doesn't call the BKL directly, because it was pushed into sem_exit and disassociate_tty, which didn't even get called. Here, we traded a 7100us average hold time for a 5.5us maximum hold time.

SPINLOCKS         HOLD            WAIT
  UTIL  CON    MEAN(  MAX )   MEAN(  MAX )(% CPU)     TOTAL NOWAIT SPIN RJECT  NAME
 0.00%  9.8%  1.6us( 5.5us)   78us( 387us)(0.00%)       193 90.2%  9.8%    0%    sem_exit+0x20


BKL Lockmeter data for 2.4.17, no patches applied:
SPINLOCKS         HOLD            WAIT
  UTIL  CON    MEAN(  MAX )   MEAN(  MAX )(% CPU)     TOTAL NOWAIT SPIN RJECT  NAME
 27.2% 32.6%  6.0us(  55ms)   39us(  42ms)( 7.1%)   8823974 67.4% 32.6%    0%  kernel_flag
 0.76% 22.8% 7711us(  55ms) 2363us(  37ms)(0.01%)       193 77.2% 22.8%    0%    do_exit+0xd8 **
  2.5% 14.9%   32us(1510us)   21us(6383us)(0.03%)    153921 85.1% 14.9%    0%    ext2_delete_inode+0x20
 0.52% 37.7%  1.3us( 345us)   31us(8684us)(0.57%)    756626 62.3% 37.7%    0%    ext2_discard_prealloc+0x20
  8.4% 35.8%  8.4us( 937us)   40us(  23ms)( 1.8%)   1939931 64.2% 35.8%    0%    ext2_get_block+0x54
  1.4% 36.8%   37us( 706us)   52us(  32ms)(0.10%)     77244 63.2% 36.8%    0%    lookup_hash+0x7c
 0.08% 18.7%  7.6us( 383us)   18us( 662us)(0.00%)     20480 81.3% 18.7%    0%    notify_change+0x50
 0.01% 30.3%  0.6us(  62us)   36us(4789us)(0.01%)     20480 69.7% 30.3%    0%    open_namei+0x444
 0.26% 45.5%  0.9us( 385us)   30us(7313us)(0.49%)    559692 54.5% 45.5%    0%    open_namei+0x4dc
 0.10% 32.5%   23us( 513us)   45us(4047us)(0.01%)      8881 67.5% 32.5%    0%    real_lookup+0x64
 0.06% 63.0%  112us(  28ms)   38us( 487us)(0.00%)      1072 37.0% 63.0%    0%    schedule+0x508
  2.3% 30.0%  1.0us( 432us)   41us(  14ms)( 3.5%)   4477600 70.0% 30.0%    0%    sys_lseek+0x70
  1.3% 31.9%   84us(1109us)   39us(6229us)(0.02%)     31200 68.1% 31.9%    0%    sys_rename+0x19c
 0.07% 42.7%  0.9us( 444us)   31us(6619us)(0.13%)    153770 57.3% 42.7%    0%    vfs_create+0x9c
  4.9% 16.4%   62us(1026us)   22us(5356us)(0.04%)    153770 83.6% 16.4%    0%    vfs_create+0xf8
 0.00% 79.8%  2.1us(  59us)   39us( 455us)(0.00%)      1772 20.2% 79.8%    0%    vfs_mkdir+0x94
 0.11% 28.8%  119us( 500us)   48us( 806us)(0.00%)      1772 71.2% 28.8%    0%    vfs_mkdir+0xf4
 0.00% 75.0%   34us(  84us)   20us(  32us)(0.00%)         4 25.0% 75.0%    0%    vfs_readdir+0x68
 0.00% 33.8%  0.6us(  49us)  381us(  42ms)(0.02%)      1919 66.2% 33.8%    0%    vfs_rmdir+0x118
 0.03% 20.5%   28us( 850us)   59us(2681us)(0.00%)      1919 79.5% 20.5%    0%    vfs_rmdir+0x1c4
 0.00% 33.3%  7.7us( 7.9us)  8.0us( 8.0us)(0.00%)         3 66.7% 33.3%    0%    vfs_statfs+0x4c
 0.05% 32.1%  0.7us( 241us)   36us(8798us)(0.11%)    153761 67.9% 32.1%    0%    vfs_unlink+0x114
  4.0% 12.3%   51us( 905us)   20us(8115us)(0.02%)    153761 87.7% 12.3%    0%    vfs_unlink+0x17c

BKL Lockmeter data for 2.4.17, BKL removal rollup applied:
SPINLOCKS         HOLD            WAIT
  UTIL  CON    MEAN(  MAX )   MEAN(  MAX )(% CPU)     TOTAL NOWAIT SPIN RJECT  NAME
 17.2%  8.2%  7.2us(  10ms)   41us(9988us)( 1.0%)   4557319 91.8%  8.2%    0%  kernel_flag
 0.00% 50.0%  1.6us( 2.1us)   59us(  59us)(0.00%)         2 50.0% 50.0%    0%    do_fcntl+0x18c
  3.0%  9.0%   36us(  10ms)   33us(3814us)(0.03%)    155532 91.0%  9.0%    0%    ext2_delete_inode+0x80 ***
 0.05%  4.8%  0.3us(  93us)   33us(9729us)(0.04%)    357232 95.2%  4.8%    0%    ext2_free_blocks+0x234
 0.09% 11.2%  0.5us( 164us)   50us(9643us)(0.12%)    324391 88.8% 11.2%    0%    ext2_new_block+0x528
 0.20%  1.6%  0.2us( 159us)   15us(6813us)(0.03%)   2063114 98.4%  1.6%    0%    ext2_new_block+0x7a4
 0.01% 0.76%  0.2us(  68us)   12us( 304us)(0.00%)     62865 99.2% 0.76%    0%    ext2_new_block+0x884
 0.01%  8.4%  0.8us(  26us)   19us(5462us)(0.00%)     20480 91.6%  8.4%    0%    inode_change_ok+0x34
 0.26%  4.1%   24us( 376us)   14us( 499us)(0.00%)     20480 95.9%  4.1%    0%    inode_setattr+0x34
  1.1% 17.4%   27us( 666us)   59us(7791us)(0.05%)     77515 82.6% 17.4%    0%    lookup_hash+0x7c
 0.00% 18.9%  0.4us(  48us)   49us(5544us)(0.01%)     20480 81.1% 18.9%    0%    open_namei+0x444
 0.11% 22.3%  0.4us( 147us)   43us(9988us)(0.35%)    559693 77.7% 22.3%    0%    open_namei+0x4dc
 0.09% 16.0%   19us(1454us)   44us(1185us)(0.00%)      8833 84.0% 16.0%    0%    real_lookup+0x64
  1.7% 26.3%   41us(4883us)   44us(9216us)(0.06%)     77152 73.7% 26.3%    0%    schedule+0x508
 0.00%  9.8%  1.6us( 5.5us)   78us( 387us)(0.00%)       193 90.2%  9.8%    0%    sem_exit+0x20
 0.09%  8.1% 4598us(9648us)  105us( 231us)(0.00%)        37 91.9%  8.1%    0%    sync_old_buffers+0x1c
  1.7% 16.1%  104us(1865us)   56us(5423us)(0.02%)     31201 83.9% 16.1%    0%    sys_rename+0x19c
 0.03% 19.2%  0.4us( 144us)   42us(9869us)(0.08%)    153762 80.8% 19.2%    0%    vfs_create+0x9c
  4.9%  4.8%   61us(1554us)   16us(1249us)(0.01%)    153762 95.2%  4.8%    0%    vfs_create+0xf8
 0.00% 28.8%  0.8us(  33us)   35us( 661us)(0.00%)      1770 71.2% 28.8%    0%    vfs_mkdir+0x94
 0.13% 10.5%  137us( 784us)   34us( 391us)(0.00%)      1770 89.5% 10.5%    0%    vfs_mkdir+0xf4
 0.00% 16.4%  0.4us(  22us)   76us( 915us)(0.00%)      1920 83.6% 16.4%    0%    vfs_rmdir+0x118
 0.03%  8.9%   29us( 630us)   53us( 423us)(0.00%)      1920 91.1%  8.9%    0%    vfs_rmdir+0x1c4
 0.03% 18.2%  0.4us( 209us)   51us(9652us)(0.09%)    153761 81.8% 18.2%    0%    vfs_unlink+0x114
  3.7%  4.1%   46us(1244us)   15us( 526us)(0.01%)    153761 95.9%  4.1%    0%    vfs_unlink+0x17c


Push the BKL out of do_exit() - download

do_exit() is called to tidy up for a process during its exit. The BKL was held through quite a few function calls where it was not needed. The solution is to move the BKL into the function which actually need it, and completely remove it from do_exit(). The patch reduces BKL hold times in do_exit() by a factor of 100. The patch announcement.

Do not combine this with the 2.4 preemption patch. It will not boot without preemption fixes covered in this thread.



Remove the BKL from ext2_get_block() - download

ext2 BKL reduction for 2.4. This patch is very important for workloads which are doing lots of ext2 filesystem I/O on different files. As my announcement email states, this is a backport of a patch which Al Viro produced for 2.5. It creates a new lock, i_meta_lock, to replace the BKL.



Push BKL out of notify_change() - download

This patch shifts the BKL out of notify_change and into the individual filesystems. I initially used an all-new semaphore. However, Linus suggested that I use i_sem. I did this, but Al Viro found a problem with this approach. A patch was eventually settled on.
A few days later, a bug was found in Al's implementation, but it was fixed by Andrew Morton.



Push the BKL into individual llseek() functions - download

This is Robert Love's patch to shift the BKL out of llseek in 2.4. A similar patch was accepted for 2.5. Robert's most up-to-date patches can be found in his kernel.org people directory. The patch I provide here is a clean backport of his 2.4.19-pre2 patch.



Written By:
Dave Hansen
haveblue@us.ibm.com