STATUS

NetBench Performance Evaluation for Linux






The chart above summarizes the recent performance improvements for Samba on Linux.  These tests were conducted on an IBM xSeries server model x440, with four 1.5 GHz P4 processors, 4 gigabit Ethernet adapters, 2 GB memory, and 14 15k rpm SCSI disks.  SuSE version 8.0 was used for all tests and each subsequent test includes one new change,  which can be a configuration change,  kernel change, or a Samba change.   NetBench's Enterprise Disk Suite, by Ziff Davis, was used for all tests.  

Baseline: This represents a clean installation of SuSE 8.0 with no performance  configuration changes.

Data=writeback:  We changed the default ext3 mount option from "ordered" to "writeback" for the /data filesystem (where the Samba shares reside).  This improves filesystem performance greatly on meta data intensive workloads such as this one.

Smblog=1:  The samba logging level was changed from 2 to 1 to reduce disk I/O to the samba log files.  A level of 1 is verbose enough to log critical errors.  

Sendfile/zerocopy:  This was a patch by Anton Blanchard so Samba would use sendfile for client read requests.  Combined with Linux zerocopy support (first available in 2.4.4), this eliminates two very costly memory copies.

O(1) Scheduler:  Just a small improvement, but will facilitate other performance improvements in the future.

Evenly affined IRQs:  Each of the 4 network adapters' interrupts are handled by a unique processor.  SuSE 8.0, for P4 architecture, defaults to a round robin assignment (destination = irq_num % num_cpus) for IRQ to CPU mappings.  In this particular case, all of the network adapters' IRQs  were routed to CPU0.  This can be very good for performance because cache warmth on this code is improved, but we wanted to evenly affine these IRQs so we can have IRQ and process affinity work together for even greater performance.  

Process Affinity:  This technique ensures that for each network interrupt that is processed, the corresponding smbd process is scheduled on the same CPU, to further improve cache warmth.  Note: this is primarily a benchmark technique, and not commonly used elsewhere.  If you can logically divide your workload evenly across many CPUs, this can be a big gain.  However, most workloads in practice are dynamic and affinity cannot be predetermined.  

/proc/sys/net/hll=763:  Increase the pool of memory dedicated for socket buffers.  This should reduce the number of calls to kmem_cache grow for sk buffs.  

Case sensitivity enforced:  When case sensitivity is not enforced, Samba may have to search for different versions of a file name before it can stat that file, since there can exist many combinations of file names for the same file.  Enforcing case sensitivity eliminates those guesses.  

Spinlocks:  Samba uses fcntl() for its database, which can be costly.  Using spinlocks avoids a fcntl() call and the use of the Big Kernel Lock found in posix_lock_file(), reducing contention and wait times for Big Kernel Lock.  To use this feature, configure Samba with "--use-spin-locks".

Dcache read copy update:  Directory entry lookup times are reduced with a new implementation of dlookup(), using read copy update technique.  For more information, see the locking section on the Linux Scalability Effort project.