Mini_HOWTO: Multi Disk System Tuning Version 0.2 Date 960321 By Stein Gjoen This document was written for two reasons, mainly because I got hold of 3 old SCSI disks to set up my Linux system on and I was pondering how best to utilize the inherent possibilities of parallising in a SCSI system. Secondly I hear there is a prize for people who write docs... This is intended to be read in conjunction with the Linux File System Standard (FSSTD). It does not in any way replace it but tries to suggest where physically to place directories detailed in FSSTD, both in terms of drives, partitions, types, RAID, file system (fs), physical sizes and other parameters that should be considered and tuned in a linux system, ranging from single home systems to large servers. This is also a learning experience for myself and I hope I can start the ball rolling with this Mini-HOWTO and that it perhaps can evolve into a larger more detailled and hopefully even more correct HOWTO. Notes in square brackets indicate where I need more information. Note that this is a guide on how to design and map logical partitions onto multiple disks and tune for performance and reliability, NOT how to actually partition the disks or format them. This is the first update, still without any inputs... So let's cut to the chase where swap and /tmp are racing along hard drive... --------------------------------------------------------------- 1. Considerations The starting point in this will be to consider where you are and what you want to do. The typical home system starts out with existing hardware and the newly converted will want to get the most out of existing hardware. Someone setting up a new system for a specific purpose (such as an Internet provider) will instead have to consider what the goal is and buy accordingly. Being ambitious I will try to cover the entire range. Various purposes will also have different requirements regarding file system placement on the drives, a large multiuser machine would probably be best off with the /home directory on a separate disk, just to give an example. In general for performance it is advantageous to split most things over as many disks as possible but there is a limited number of devices that can live on a SCSI bus and cost is naturally also a factor. 1.1 File system features The various parts of FSSTD have different requirements regarding speed, reliability and size, for instance losing root is a pain but can easily be recovered. Losing /var/spool/mail is a rather different issue. Here is a quick summary of some essential parts and their properties and requirements. [This REALLY need some beefing up]: 1.1.1 Swap Speed: Maximum! Though if you rely too much on swap you should consider buying some more RAM. Size: Quick and dirty algorithm: just as for tea: 16M for the machine and 2M for each user. Smallest kernel run in 1M but is tight, use 4M for general work and light applications, 8M for X11 or GCC or 16M to be comfortable. [The author is known to brew rather powerful tea...] Reliability: Medium. When it fails you know it pretty quickly and failure will cost you some lost work. You save often, don't you? 1.1.2 /tmp and /var/tmp Speed: Very high. On a separate disk/partition this will reduce fragmentation generally, though ext2fs is handles fragmentation rather well. Size: Hard to tell, small systems are easy to run with few megs but these are notorious hiding places for stashing files away from prying eyes and quota enforcements and can grow without control on larger machines. Suggested: small machine: 8M, large machines up to 500M (The machine here has 1100 users and 300M /tmp file). Reliability: Low. Often programs will warn or fail gracefully when these areas fail or are filled up. Random file errors will of course be more serious, no matter what file area this is. (* That was 50 lines, I am home and dry! *) 1.1.3 Spool areas (/var/spool/news, /var/spool/mail) Speed: High, especially on large news servers. News transfer and expiring are disk intensive and will benefit from fast drives. Print spools: low. Consider RAID0 for news. Size: For news/mail servers: whatever you can afford. Single user systems a few megs will be sufficient if you read continuously. Joining a list server and taking a holiday is on the other hand it not a good idea. (Again the machine I use has 100M reserved for the entire /var/spool) Reliability: Mail: very high, news: medium, print spool: low. If your mail is very important (isn't it always?) consider RAID for reliability. [Is mail spool failure frequent? I have never experienced it but there are people catering to this market of reliability...] Note: Some of the news documentation suggests putting all the .overview files on a drive separate from the news files, check out the news FAQs for more information. 1.1.4 Home directories (/home) Speed: Medium. Although many programs use /var for temporary storage, other such as some newsreaders frequently update files in the home directory which can be noticeable on large multiuser systems. For small systems this is not a critical issue. Size: Tricky! On some systems people pay for storage so this is usually then a question of economy. Large systems such as nyx.net (which is a free Internet service with mail, news and WWW services) run successfully with a suggested limit of 100K per user and 300K as max. If however you are writing books or are doing design work the requirements balloon quickly. Reliability: Variable. Losing /home on a single user machine is annoying but when 2000 users call you to tell you their home directories are gone it is more than just annoying. For some their livelihood relies on what is here. You do regular backups of course? Note: You might consider RAID for either speed or reliability. If you want extremely high speed and reliability you might be looking at other OSes and platforms anyway. (Fault tolerance etc.) 1.1.5 Main binaries ( /usr/bin and /local/bin) Speed: Low. Often data is bigger than the programs which are demand loaded anyway so this is not speed critical. Witness the successes of life file systems on CD ROM. Size: The sky is the limit but 200M should give you most of what you want for a comprehensive system. (The machine I use, including the libraries, uses about 800M) Reliability: Low. This is usually mounted under root where all the essentials are collected. Nevertheless losing all the binaries is a pain... 1.1.6 Libraries ( /usr/lib and /local/lib) Speed: Medium. These are large chunks of data loaded often, ranging from object files to fonts, all susceptible to bloating. Often these are also loaded in their entirety and speed is of some use here. Size: Variable. This is for instance where word processors store their immense font files. [actual sizes, anyone? I'd like data for GCC related libraries, TeX/LaTeX, X11 and others that can be relevant] Reliability: Low. See point 1.1.5 1.1.7 Root Speed: Quite low: only the bare minimum is here, much of which is only run at startup time. Size: Quite small. Biggest file is /vmlinuz, unless you have a large rescue file collection about 4M should be sufficient. Reliability: High. A failure here will possible cause a lot of grief and with with rescuing your boot partition. Naturally you do have a rescue disk? 1.2 Explanation of terms Naturally the faster the better but often the happy installer of Linux has several disks of varying speed and reliability so even though this document describes performance as 'fast' and 'slow' it is just a rough guide since no finer granularity is feasible. Even so there are a few details that should be kept in mind: 1.2.1 Speed This is really a rather woolly mix of several terms: CPU load, transfer setup overhead, disk seek time and transfer rate. It is in the very nature of tuning that there is no fixed optimum, and in most cases price is the dictating factor. CPU load is only significant for IDE systems where the CPU does the transfer itself [more details needed here !!] but is generally low for SCSI, see SCSI documentation for actual numbers. Disk seek time is also small, usually in the millisecond range. This however is not a problem if you use command queuing on SCSI where you then overlap commands keeping the bus busy all the time. News spools are a special case consisting of a huge number of normally small files so in this case seek time can become more significant. 1.2.2 Reliability Naturally none wants low reliability disks but one might be better off regarding old disks as unreliable. Also for RAID purposes (See the relevant docs) it is suggested to use a mixed set of disks so simultaneous disk crashes becomes less likely. 1.3 RAID This is a method of increasing reliability, speed or both by using multiple disks in parallel thereby increasing access time and transfer speed. A checksum or mirroring system can be used to increase reliability. Large servers can take advantage of such a setup but it might be overkill for a single user system unless you already have a large number of disks available. See other docs and FAQs for more information. 1.4 AFS, Veritas and Other Volume Management Systems Although multiple partitions and disks have the advantage of making for more space and higher speed and reliability there is a significant snag: if for instance the /tmp partition is full you are in trouble even if the news spool is empty, it is not easy to retransfer quotas across disks. Volume management is a system that does just this and AFS and Veritas are two of the best known examples. Some also offer other file systems like log file systems and others optimised for reliability or speed. Note that Versitas is not available (yet) for Linux and it is not certain they can sell kernel modules without providing source for their proprietary code, this is just mentioned for information on what is out there. Still, you can check their web page http://www.veritas.com to see how such systems function. 1.5 Linux md Kernel Patch There is however one kernel project that attempts to do some of this, md, which has been part of the kernel distributions since 1.3.69. Currently providing spanning and RAID it is still in early development and people reporting varying degrees of success as well as total wipe out. Use with caution. 1.6 General File System Consideration In the Linux world ext2fs is well established as a general purpose system. Still for some purposes others can be a better choice. News spools lend themselves to a log file based system whereas high reliability data might need other formats. This is a hotly debated topic and there are currently few choices available but work is underway. [I believe someone from Yggdrasil mentioned a log file based system once, details? And AFS is available to Linux I think, sources anyone?] There is room for access control lists (ACL) and other unimplemented features in the existing ext2fs, stay tuned for future updates. There has been some talk about adding on the fly compression too. DouBle already features file compression with some limitations. Zlibc adds transparent on-the-fly decompression of files as they load. Also there is the user file system that allows ftp based file system and some compression (arcfs) plus fast prototyping and many other features. 2 Disk Layout With all this in mind we are now ready to embark on the layout [and no doubt controversy]. I have based this on my own method used when I got hold of 3 old SCSI disks and boggled over the possibilities. 2.1 Selection Determine your needs and set up a list of all the parts of the file system you want to be on separate partitions and sort them in descending order of speed requirement and how much space you want to give each partition. If you plan to RAID make a not of the disks you want to use and what partitions you want to RAID. Remember various RAID solutions offers different speeds and degrees of reliability. (Just to make it simple I'll assume we have a set of identical SCSI disks and no RAID) 2.2 Mapping Then we want to place the partitions onto physical disks. The point of the following algorithm is to maximise parallelizing and bus capacity. In this example the drives are A, B and C and the partitions are 987654321 where 9 is the partition with the highest speed requirement. Starting at one drive we 'meander' the partition line over and over the drives in this way: A : 9 4 3 B : 8 5 2 C : 7 6 1 This makes the 'sum of speed requirements' the most equal across each drive. 2.3 Optimizing After this there are usually a few partitions that have to be 'shuffled' over the drives either to make them fit or if there are special considerations regarding speed, reliability, special file systems etc. Nevertheless this gives [what this author believes is] a good starting point for the complete setup of the drives and the partitions. In the end it is actual use that will determine the real needs after we have made so many assumptions. After commencing operations one should assume a time comes when a repartitioning will be beneficial. 3 Further Information There is wealth of information one should go through when setting up a major system, for instance for a news or general Internet service provider. The FAQs in the following groups are useful: News groups: comp.arch.storage, comp.sys.ibm.pc.hardware.storage, alt.filesystems.afs, comp.periphs.scsi ... Mailing lists: raid, scsi ... many mailing lists are at vger.rutgers.edu but this is notoriously overloaded, try to find a mirror. There are some lists mirrored at http://www.redhat.com [more references please!]. [much more info needed here] 4 Concluding Remarks Disk tuning and partition decisions are difficult to make, and there are no hard rules here. Nevertheless it is a good idea to work more on this as the payoffs can be considerable. Maximizing usage on one drive only while the others are idle is unlikely to be optimum, watch the drive light, they are not there just for decoration. For a properly set up system the lights should look like Christmas in a disco. Linux offers software RAID but also support for some hardware base SCSI RAID controllers. Check what is available. As your system and experiences evolve you are likely to repartition and you might look on this document again. Additions are always welcome.