Here are tips on manipulating some of Linux's performance-tuning parameters to make the system scream.
I/O Tuning
You'll face two types of performance bottlenecks when designing a system: input/output and program execution. We'll concentrate on I/O performance because it is the area most commonly addressed by IT managers and administrators. Most improvements in the arena of program execution must be discovered and implemented by the programmer, not the IT department. (It is useful, however, to know when your CPU is taxed.)
If you're running Linux as a Web server or a file server, your tuning priorities are going to be different than if you're running it as a desktop. On a server, you tend to maximize speed and responsiveness at the cost of "space" (RAM and disk space). While the Linux desktop market does exist, maximizing Linux's use as a server is our goal here.
Network Settings
Most tunable values in the Linux kernel are found under the /proc directory of your file system. In the domain of network tuning, there are many tunables, and you should leave most of them alone. In /proc/ sys/net, especially the subdirectories ipv4 and core, there are a number of files that, when read, reveal current kernel networking values and, when written to, let you change those networking values. (These files are poorly documented in the file Documentation/networking/ip-sysctl.txt in the Linux kernel source-code distribution.) If you pose the question, "How do I tune Linux's TCP/IP network stack?" to the Linux kernel development mailing list, you'll be told you should leave it alone. The default values are intended to provide the best performance in as many cases as possible. They are tunable to provide support for odd cases when dealing with buggy or nonstandard network nodes or to allow for experimentation.
There is one value you should consider disabling: "TCP Timestamps." According to the TCP/IP specification, time-stamping is optional, so turning it off will not break interoperability. Time stamps are intended to provide round-trip timing of packets to enable congestion control algorithms. They aren't needed if the majority of your network connections come from high-speed, noncongested LANs. Turning it off reduces the computation of your Linux node and slightly reduces packet size. The truth is, however, that you probably will not perceive benefits from doing this unless your system is highly loaded, at which point you probably should consider upgrading your hardware anyway.
More useful is the option to "Allow Large Windows." This controls how big the send and receive buffers will be on TCP/IP socket connections before the other end must acknowledge receipt. On long-distance high-speed links (where bandwidth is large, but so is delay), this control lets you take advantage of your high-capacity network. You should have at least 16 MB of RAM to do this. To determine if a Linux kernel is enabled with large windows, simply read the file /proc/ sys/net/core/rmem_max (or /proc/sys/net/core/wmem_max). If the value is 65535, it is enabled. Otherwise, you can enable it by writing a new value into that file: echo 65535 > /proc/sys/net/core/ rmem_max. (Do the same for /proc/sys/net/core/wmem _max, /proc/sys/net/core/ rmem_default and /proc/sys/net/core/wmem_default.)
The value 65535 is a suggestion. As you increase it (the maximum is 231 %AD1), you run the risk of having to retransmit too much data in the event a packet gets lost. (The window also may consume more system RAM, as it could require at least two chunks of RAM that size for each network connection to your machine.) As you decrease it, you cannot take full advantage of your long-distance high-speed network.
Disk I/O
The disk subsystem of the Linux kernel is very conservative with your data. When not completely sure if a certain disk or controller reliably handles a certain setting (such as using Ultra DMA or IDE Block Mode to transfer more sectors on a single interrupt or 32-bit bus transfers), it will default to the setting least likely to cause data corruption. Using a tool called hdparm, you can change the settings and evaluate their robustness before using them in a production environment. (Actually, hdparm primarily affects IDE systems. It affects only a very limited number of items on SCSI disks.) On our test computer, by enabling Ultra DMA (using the command hdparm -d1 /dev/hda), we sped up data-read transfers from 2.17 MB per second to 11.19 MB per second. Write transfers will see a similar increase in speed. On a significantly older machine, there was no change in speed.
The only drawback to enabling this setting is the remote possibility that very old hardware will not properly implement Ultra DMA. On our test computer, setting the use of IDE Block Mode (also known as Multiple Sector Mode) via the command hdparm -m16 /dev/hda actually reduced throughput. Enabling 32-bit bus transfers (with hdparm -c1 /dev/hda) resulted in a small increase in speed but when combined with Ultra DMA delivered no observable benefits in our tests. See the documentation for hdparm (man hdparm) for more information.
Your system may see different benefits from various applications of these three settings depending on the drive and motherboard chipset. Keep an eye on /var/adm/messages for the message "DMA disabled" or something similar. This is a sign the IDE driver is having problems using Ultra DMA on a drive and has disabled it for the safety of your data. You may need to obtain another hard drive if you need the increased throughput.
Also by using hdparm (specifically the -a option), you can alter the read-ahead buffering. Under the default setting, 8 KB of data in a file is read as soon as a process reads from a file, which is good on systems where files are read in their entirety. But on systems where random parts of a file may be read, such as in a database application reading records from different locations in a file, lowering this value may be helpful because such read aheads are counterproductive.
Consider your partitioning scheme as you set up a new Linux box. According to the Linux file-system standard, which is based on Unix practice, applications will write out their logs to /var/log, so put /var/log or /var on a disk separate from your applications, especially applications that are disk-intensive, such as databases. It is also beneficial to place these separate disks on separate disk controllers, especially if they are IDE disks. (These log hints are true for any operating system.)
You can increase your file-system buffer cache simply by adding more RAM to the system. Unused RAM is wasted RAM, and Linux will use available RAM as a cache for file-system reads and writes. This causes people to complain all their RAM is full when their systems aren't doing anything. That RAM is being used for buffers; if the RAM is needed for an application, the buffers will be flushed and RAM will be used for the application.
On the subject of RAM, we should address swap space. Swap space is disk space reserved for writing out pages of memory to disk in the event the memory is needed for other purposes. RAM used by one application is temporarily swapped out to free up space for another active application. Ideally, a system should never swap, because swapping a page back into real RAM takes time. Linux does support swapping to a file, though it won't grow dynamically. Using a file for swap is not recommended, because of the overhead of having to go through the file-system code to access swap. Using a dedicated partition is more efficient. Using two or more partitions is even better, because the kernel will load-balance between them.
Between the physical disk and application is the file system. The default Linux file system is called the Second Extended File System (ext2fs). When creating a file system on a disk, you can minimize the total read time by increasing the block size of a file system. (In file systems, the block size is the minimum unit that can be allocated for a file.) Doing this requires fewer total reads to read in a file. On ext2, the default block size is 1,024 bytes, but the maximum block size is 4,096 bytes. Maxing out block size has the drawback of increasing internal fragmentation. All files use at least one block. If a file is smaller than 4,096 bytes, it will still get one block, thus using 4,096 bytes minimum. Disk space tends to be cheap, so put in a larger disk and use maximum block size unless you know your average file size is significantly less than 4,096 bytes.
Another small performance tip at the file-system level is to disable the access-time (atime) metadata that is written to disk each time a file is accessed. This data usually isn't needed, and preventing access-time writes on a large set of files being served via a Web or another file server is worthwhile. You prevent access-time write on an entire file system by setting the noatime option in /etc/fstab, or on particular files or directories with the chattr command.
When using the NFS (Network File System) client on a Linux box, pay attention to the read and write block sizes that the client uses to communicate with the NFS server. It is commonly known that when communicating to a Sun Microsystems Solaris-based NFS server, the Linux client must be configured to use 8,192-byte block sizes. In /etc/fstab, set column four of the relevant entry to rsize=8192,wsize=8192.
Other factors, such as network-card idiosyncrasies, typical network load and other permutations of client and server software, may warrant other read and write block sizes. See the reference information in the glossary for a tool called nfspmon to analyze NFS setups.
Linux for Software RAID
RAID is an option to increase performance (Level 0) or performance and redundancy (Level 5 or others). Hardware solutions are the simplest but can be expensive. Linux gives you the option of using software RAID. No special hardware is needed besides sufficient disks and disk controllers. Doing RAID 0 in software is not as fast as doing it in hardware but is faster than using each disk as a separate file system, because reads and writes are multiplexed between all the disks. Doing RAID 5 or RAID 10 also buys you redundancy. This can equate to an increase in performance when you consider downtime as zero bandwidth.
If you stumble across a situation in which you do not think you are getting the performance you could be, and you are fairly certain that you have configured everything to maximize efficiency, characterize the problem, including all relevant hardware and network configurations, and send a message to the kernel developer mailing list. Chances are someone will respond with fixes to improve the situation.
One of the goals of Linux kernel development is to make workshops such as this one irrelevant. That is, Linus Torvalds and the programmers who develop Linux intend to deliver a kernel that is preconfigured for maximum performance. No tweaks should be necessary. They intend to do this by experimenting with self-adjusting algorithms capable of detecting and rectifying bottlenecks. Extensive work is being done to improve the memory-management subsystem. On high-performance file systems, special Web and file servers can service requests more swiftly by bypassing many of the layers that exist between a requesting client and the requested files.
Performance From Open Source
As is always true, throwing more money at a problem, in the form of buying more or better hardware, can "solve" it. This technique does not remove the bottleneck; it simply widens all the paths, so the bottleneck is no longer felt. But there is another way to throw money at Linux performance problems. Because of its open-source licensing, Linux can be modified by anyone, so you're free to change it as you please. Therefore, you also have the option of hiring programmers--even Linux kernel programmers--to implement your performance solution. Or do it yourself.
Jeremy Impson is an associate network engineer at the Center for Nomadic Computing and Mobile Networking at Lockheed Martin Systems Integration in Owego, NY.
