reiserfs

[Archivist's note: Although this file's title and much of ext2/ext3 designer/coder Ted T'So's postings (quoted below) specifically refer to ReiserFS, the substance of his critique applies equally to SGI's XFS. Accordingly, people favouring ReiserFS or XFS on commodity PC hardware should think long and hard before using those with data they care about — absent uninterruptable power supplies. (It's unknown whether IBM's JFS poses the same risks, but I think it very likely.)]

Date: Sun, 19 Dec 2004 22:54:02 -0500
From: "Theodore Ts'o" <tytso@mit.edu>
To: [a mailing list]
User-Agent: Mutt/1.5.6+20040907i
Subject: Re: [evals] ext3 vs reiser with quotas

On Thu, Dec 16, 2004 at 02:50:45PM -0800, Rick Moen wrote:
> Quoting Nick Moffitt (nick@zork.net):
>
> > Tell me, is fsck.reiser still equivalent to mkfs.reiser?
>
> Pretty much. It's the closest thing to filesystem Lotto.

For a very good time, create a few dozen files containing images of ReiserFS filesystems on a ReiserFS (scratch) filesystem, and force an fsck.reiser. All of ReiserFS is a single B-tree, that can be anywhere on disk. So what fsck.reiser does is to search the entire disks for blocks that look vaguely like parts of the filesystem B-tree, and stitches them all together. Whee!!!!

- Ted

Date: Sun, 19 Dec 2004 23:10:09 -0500
From: "Theodore Ts'o" <tytso@mit.edu>
To: [a mailing list]
User-Agent: Mutt/1.5.6+20040907i
Subject: Re: [evals] ext3 vs reiser with quotas

On Wed, Dec 15, 2004 at 01:31:13PM -0800, Sean Perry wrote:
> Net Llama! wrote:
> >I really have to ask what you were running when you had that FS
> >corruption. I've never seen anything like that, and i've been using XFS
> >for about 3 years now on over 15 different boxes.
> >
>
> Simple as pie. Set up a machine running XFS. Start some disk activity
> (maybe a test mail server). When the FS is getting solid activity, yank
> the power out. When I saw this happen I had to mkfs and start over.

What probably hit you here is caused by the very simple fact that PC-class hardware is crap.

You see, when you yank the power cord out of the wall, not all parts of the computer stop functioning at the same time. As the voltage starts dropping on the +5 and +12 volt rails, certain parts of the system may last longer than other parts. For example, the DMA controller, hard drive controller, and hard drive unit may continue functioning for several hundred of milliseconds, long after the DIMMs, which are very voltage sensitive, have gone crazy, and are returning total random garbage. If this happens while the filesystem is writing critical sections of the filesystem metadata, well, you get to visit the fun Web pages at http://You.Lose.Hard/ .

I was actually told about this by an XFS engineer, who discovered this about the hardware. Their solution was to add a power-fail interrupt and bigger capacitors in the power supplies in SGI hardware; and, in Irix, when the power-fail interrupt triggers, the first thing the OS does is to run around frantically aborting I/O transfers to the disk. Unfortunately, PC-class hardware doesn't have power-fail interrupts. Remember, PC-class hardware is cr*p.

Why doesn't ext3 get hit by this? Well, because ext3 does physical-block journaling. This means that we write the entire physical block to the journal, and only when the updates to the journal are committed do we write the data to the final location on disk. So, if you yank out the power cord, and inode tables get trashed, they will get restored when the journal gets replayed.

XFS and ReiserFS do what is called logical journaling. This means that if you modify the mod time of an inode, instead of writing the entire inode table block to the journal, they just write a note in the journal stating that "inode X now has a mod time of Y". This takes much less space, so you can pack many more journal updates into a single disk block. For filesystem benchmarks that try to saturate the filesystem's write bandwidth, and/or that also have very high levels of metadata updates, XFS and ReiserFS will tend to do much better than does ext3. Fortunately, many real-world workloads don't have this characteristic, which is why ext3 tends to perform just fine in practice in many applications, despite what would appear to be much worse benchmarks numbers, at least for some benchmarks.

The problem with logical journaling is that if you do have an unexpected power drop, and the storage system scribbles garbage on the inode table, there's no way to recover. In some cases, the only thing that is left to be done is mkfs and restore from backups. (Backups? You do keep backups, right?)

The practical upshot of this is that if you use XFS or ReiserFS, and you are on crappy PC-class hardware, you ***MUST*** have a UPS, and use a serial/USB cable, so that system can do a graceful shutdown when the UPS's batteries are exhausted.

This issue is completely different from the XFS issue of zeroing all open files on an unclean shutdown, of course. Having a UPS won't save you from that particular XFS misfeature. The reason why it is done is to avoid a potential security problem, where a file could be left with someone else's data. Ext3 solves this problem by delaying the journal commit until the data blocks are written, as opposed to trashing all open files. Again, it's a solution that can impact performance, but at least in my opinion, for a filesystem, performace is Job #2. Making sure you don't lose data is Job #1.

> Install Debian on a laptop using XFS. Was attempting to use ACPI for
> battery stats, sleeping, etc. Turned out ACPI crashes this thinkpad. Was
> doing a apt-get install on a large package list and the box wedged,
> hard. Turn it back on and there was nothing left of /var. Just rubble.

Your other mistake was trying to use ACPI. :-)

- Ted