[conspire] annoying ext4 disaster under Kademar

Mon Oct 31 15:41:10 PDT 2011

Quoting Bruce Coston (jane_ikari at yahoo.com):

> Does anyone have field experience with ext4 filesystems failing after
> trying to write too many files onto them?

While we're waiting for you to detail what symptoms you're dealing
with:

Any filesystem can get somewhat munched by either filesystem software
bugs, misbehaving userspace software, sudden power failures, etc.  Your
basic tool for recovery is one of a class of tools called 'fsck', which
stands for filesystem check and includes repair.  (I'm covering this
topic broadly so that everyone including novices might benefit.)

Your symptom might be that Kademar is unable to boot from the root
filesystem, and halts with an error indicating filesystem problems, or
it might be that booting halts and puts you into single-user mode
because a non-root filesystem could not be mounted.

If it's a non-root filesystem, boot into single-user mode to do the fsck
operation.  If it's the root filesystem, haul out your live CD or
flash-drive maintenance media, and boot that.  

As the root user, do:

/sbin/fdisk -l 

...to show the partition table.  If it's a drive other than /dev/sda1
(default) that has the problem partition, specify its device name at the
end of that command.  If you have a MPT-type partition table, then do
whatever 'parted' operation is equivalent (instead of /sbin/fsck, which
is limited to old-style IBM/Microsoft-type partition tables).

If the partition table is scrambled, haul out your Parted Magic live CD,
boot that, and run the TestDisk utility to try to diagnose and fix it.

WARNING:  Any repair utility can, if it guesses wrong about what to do,
cause further damage.  If you want to protect against that scenario,
don't repair the damaged filesystem but rather first try repair against
a bitwise copy you create using dd.  Otherwise, onwards:

/sbin/fsck.ext4 -v /dev/XXX

(where XXX is your partition's device specifier, and '-v' is the verbose
option)

The fsck utility might balk if the superblock is bad.  If that happens,
don't worry; that's why there are multiple copies of the superblock
stored for each filesystem.  Find them:

/sbin/mke2fs -n /dev/XXX

Yes, 'mke2fs' is correct, even though this is ext4.  You'll get a list
of the starting block numbers for _additional_ copies of the superblock.
Try fsck again with one of the starting block numbers in place of NNNNN:

/sbin/fsck.ext4  -b NNNNN   -v  /dev/XXX

Let fsck run to completion, and note what it says.  If it appears to be
clean, reboot and go back into production operation.  If not, try doing
fsck using a different superblock copy.

If nothing works, congratulate yourself for your foresight in having
created good backups.

And, by the way, if your Kademar kernel is earlier than 2.6.32-rc6, 
then the ext4 filesystem code suffers a nasty data-loss-causing bug, and
you should upgrade it.
http://www.phoronix.com/scan.php?page=news_item&px=NzY2OQ