A Comparison of Linux's Journaled Filesystems
[Taken from an e-mail by Rick Moen to a consulting client, around May 2005.]
Disclaimer: Choice of filesystem in Linux is a controversial and debatable matter: Each of the major choices has proponents. Moreover, the facts have changed somewhat over time. It's been a bit difficult to distinguish reports based on recent code with those based on earlier code. Also, since distribution vendors sometimes apply their own ideosyncratic patches, e.g., Red Hat's ext3 as of a certain date may be somewhat different from that of other distributions with different kernels. The discussion below will aim to cover ext3/ext2, ReiserFS, XFS, and (briefly) IBM's JFS.
Comments will attempt to concern recent ext3 versions, ReiserFS4, and XFS 1.1 — attempting to exclude impressions left by earlier versions that are logically no longer relevant. What follow are very general characteristics, which obviously won't apply (or be relevant) in all situations.
You will always have to go through a minimal fsck at boot time even with a journaled filesystem, but it's so quick you will barely see it — which is one of the advantages they have. The extent of journaling (metadata-only or metadata + data) can generally be set via mount options — with the exception that Reiser4 apparently cannot be set for metadata-only.
ext2 is not, of course, a journaled filesystem. However, it's notable because it's generally about the highest-performance filesystem in general use on any OS. Many admins elect to have portions of the file tree that are non-critical (e.g., /tmp) or that are to be normally mounted read-only (e.g., /usr) be formatted as ext2 for performance reasons.
In its general characteristics, ext2 is very much like FreeBSD's FFS with the "-o async" mount option (which explains why it's prone to occasional lossage in crashes). I still tend to use it for performance reasons on /tmp & similar, and on filesystems normally mounted read-only (e.g., /usr).
Modified ext2 with integrated journaling ("logging" in Solaris lingo).
ext3 has good performance generally, and some of the fastest I/O speed for reads. It has the unique advantage that, if the journal is ever damaged or questionable, you can remount the filesystem as ext2 and recreate the journal. I.e., it has a "fallback" to non-journaled mode for recovery.
ext3 has by far the most mature, conservative, effective fsck and mkfs utilities (maintained by Ted T'so), and general operation. It is designed to be forgiving of the failure modes of commodity x86 hardware, which can be dangerous to data.
Alternative write modes, selected as mount options (summarised from an IBM article on Linux advanced filesystem management):
data=writeback: Metadata-only, and best performance. Could allow recently modified files to become corrupted in the event of an unexpected reboot.
data=ordered: Officially journals only metadata, but it logically groups metadata and data blocks into a single unit called a transaction. When it's time to write the new metadata out to disk, the associated data blocks are written first. data=ordered effectively solves data=writeback's corruption risk (shared by other journeled FSes generally), without resorting to full journaling. data=ordered tends to be perform slightly slower than data=writeback, but significantly faster than full journaling.
When appending data to files, data=ordered provides all of full data journaling's integrity protection. However, if part of a file is being overwritten when the system crashes, it's possible the region being written will receive a combination of original blocks interspersed with updated blocks. This is because data=ordered provides no guarantees as to which blocks are overwritten first, so you can't assume that just because overwritten block x was updated, that overwritten block x-1 was updated as well. Data=ordered leaves the write ordering up to the hard drive's write cache. In general, this doesn't bite people often; file appends are much more common than overwrites. So, data=ordered is a good higher-performance replacement for full journaling.
data=journal: Full data and metadata journaling: All new data are written to the journal first. Oddly, in certain situations, data=journal can be blazingly fast, where data needs to be read from and written to disk at the same time (interactive I/O).
Reports say that data=journal provides highest throughput for mail spools, because the small files are written, sent, then removed. When this is all done in the journal, things supposedly go much faster. Otherwise, data=writeback, which is the default, tends to provide best overall performance. Directory indexing supposedly improves access for directories with many small files significantly.
ext3 defaults to syncing all its buffers every 5 seconds, out of conservatism. This, too, can be increased for higher performance (mount parameter "commit=nnnn").
Linux 2.6.0 and later kernels + related utilities (or patched 2.4 kernels such as Red Hat's) also add directory-hashing support, which dramatically speeds performance on directories with thousands or more of files. Do "tune2fs -O dir_index /dev/[device]; e2fsck -D /dev/[device]" on each filesystem of interest, to enable directory hashing.
ReiserFS has gone through (at last count) four distinct, on-disk formats, with at best rocky compatibility from one to the next. The "fsck" (filesystem check) utilities for ReiserFS has earned a reputation for often repairing filesystems by massive deletion of files. This appears to happen primarily because of loss of metadata, as opposed to damage to datafiles. Many observers have been leery of the design, for those two reasons. Some would object that the characteristics of versions before ReiserFS4 are no longer relevant: Others hold the inconsistent, changing design, and severe reliability problems of the prior code against it.
ReiserFS enjoys fast file creation/deletion. It's best used for filesystems housing large numbers of small, changeable files, e.g., a machine running a Usenet news server. Reiser is space-efficient and does not pre-allocate inodes: They are done on the fly.
ReiserFS defaults to writing metadata before data. ext3-like behaviour can be forced, instead, by using a "data=ordered" mount parameter.
SGI's XFS is generally the fastest of the journaled filesystems, having exceptionally good performance for filesystems housing (individually) very large files (gigabytes each) on very large partitions, e.g., for video production: XFS was designed (on SGI IRIX) to be a full 64-bit filesystem from the beginning, and thus natively supports files as large as 2^63 = 9 x 10^18 = 9 exabytes (about a million terabytes) as implemented on 2.6 kernels, or 64 terabytes as implemented on 2.4. XFS is much faster than the other filesystems at deleting large files — an order of magnitude faster than ext3 and two orders of magnitude faster than ReiserFS.
When last I used it, XFS performance on small and medium-sized files tended to be relatively a little slower than ReiserFS and a bit faster than ext3, but it's possible that this may have changed.
XFS defaults to writing metadata before data, and this behaviour cannot be overridden.
The biggest problem with XFS is that the very extensive changes SGI had to make to the kernel's VFS layer, to incorporate it, seem troubling.
Like most people I have no experience with IBM's JFS for Linux (which IBM ported from OS/2, rather than from AIX). However, a friend who's used it extensively on Linux sends the following report:
JFS is generally reliable, but lost/damaged files show up in lost+found more often than they do with XFS. On the other hand, such files are more likely to be intact: XFS tends to pad them with null sections, which you must remove. JFS has a somewhat higher CPU cost than does XFS.
Both ReiserFS and XFS impose significant additional CPU load, relative to ext3 (except that XFS has very low CPU load, relatively speaking, when handling very large files.
There are many other subtleties of filesystem performance, such as whether performance is good with files that are extended in place as opposed to being erased and rewritten. If those are of interest, I can attempt to research them. (I've not included those for lack of data, at the moment.)
Note that performance can be strongly affected by other filesytem mounting options not covered here, e.g., admins will sometimes use the "noatime" option to gain higher performance on certain portions of the file tree, where it's not necessary to keep a timestamp of when each file was last opened.