[conspire] The practice of making ext4 a default needs to die an excruciating and gruesome death

Daniel Gimpelevich daniel at gimpelevich.san-francisco.ca.us
Thu Aug 9 22:05:56 PDT 2012


Many here remember that for years I sang the praises of IBM's JFS. I
stopped doing that after it pissed me off one too many times. With JFS,
even the slightest unclean unmount means the file system will absolutely
refuse to be mounted ever again until you fsck it. If that file system
is connected by means of a flaky USB cable, that means one might need to
repeat hassle every couple of minutes. That's pretty frustrating, and so
is having to do that to the root file system after any unclean shutdown.
Every so often, that fsck would also automagically delete a random file
or two. Additionally, legacy GRUB could not read it beyond a certain
version, so one needed a /boot partition.

So, after enough of that, I looked into the state of btrfs and decided
not to take the plunge, especially after reading that existing ext4 file
systems could be converted in place to btrfs at any time by recreating
the metadata within the free space and not touching the data itself. I
decided to tar up my entire hard disk onto an external (still formatted
with JFS, I might add) and mkfs.ext4 my JFS root, untarring things right
back to how they were. I replaced my old /boot with an image of
SystemRescueCD. This all succeeded just fine. I no longer had to deal
with mount not working, and fsck would even run itself without my
invoking it, just from my mounting an uncleanly unmounted file system.
Yey! (Not.) Said fsck would constantly notify me that it was clearing a
bunch of "orphaned inodes" i.e. it was wantonly deleting my files. :(

Some time earlier, I had noticed that after I had upgraded from Lucid to
Maverick, I began to experience random hardware freeze-ups. At this
point, I am hoping it might be because my laptop battery needs to be
replaced, and I hope they stop after I replace it. If they do not, I
will reapply the Arctic Silver 5 to the CPU in hopes THAT will fix it.
If that still won't do it, I will return to the assumption that it is a
software-caused issue.

I use Tomboy Notes for a lot of handy things and I consider its
functionality essential, but I have no love for the piece of software
itself and the abhorrent Mono bloat it represents. I continue to use it,
having not bothered to investigate saner replacements.

One day, I was updating a simple financial ledger I was keeping in part
of one of the notes, and one of the aforementioned freeze-ups happened
as I was typing. Horrifyingly, the hard disk access light was lit and
not even flickering anymore. After the panic slightly subsided enough
for me to force reboot, I opened Tomboy Notes again, and that note was
no longer in the list. I dug into the directory where the notes are kept
(~/.local/share/tomboy) and found the note files themselves. They are
stored as XML files with UTF-8 header bytes. There was no file
corresponding to the missing note. Nothing else seemed to be gone. I
rebooted into SystemRescueCD and franticly ran several passes of
PhotoRec with various permutations of options. All it could turn up that
was of any use was a several-months-old version of that Tomboy note,
with too much newer data simply lost. I decided to grep the entire file
system itself for certain data I knew I had only in that note. Again,
all that could be found was that outdated version of the note.
SystemRescueCD kept running out of memory doing these things, so I
stupidly enabled the swap partition. Major facepalm when I then realized
that having failed to find the data anywhere else, I should grep THAT. I
turned off swap and did so. What seemed like a lifetime of panic came to
an end when one of the matches in swap turned out to be the missing
note, but in plain text and not XML for some reason. It was not the most
current version either, but it was recent enough that I could
reconstruct the remaining missing data just from memory. Whew!

In the past, Rick had explained on this mailing list several reasons why
WUBI is bad. While attending SCaLE9x in 2011, I made a WUBI install of
Maverick Meerkat onto a certain person's Dell laptop that came with
Vista. With WUBI, a file is created in Windows's NTFS to be mounted on a
loopback device as the root file system in Ubuntu. A swap file is then
made in that root file system for virtual memory. Eek! This means that
the for the kernel swapper to page out, it needs to access the file by
going through the file system it's on, in turn by going through the
loopback device to the file where THAT lives, in turned by going through
the file system THAT lives on, which is NTFS, which is accessed by means
of the ntfs-3g _user space_ driver via FUSE. This of course means that
an unclean unmount of said NTFS is a potential monkey wrench in the
works. The battery in that Dell completely failed less than a year after
that laptop was purchased brand-new from Best Buy. Dell insisted that
their one-year warranty did not extend to the battery. The DC jack on
that laptop is VERY loose. Inevitably, one time when the plug came out
as the computer was running, Ubuntu would not boot up again when power
was reapplied. Being the nearest technical person, I took a look. The
unclean unmount of the NTFS forced a CHKDSK, which found something wrong
with the inode for the file holding the loopback file system. As a
corrective measure, without human intervention, it simply deleted that
file. The reason Ubuntu was no longer booting was that it was no longer
present on the computer. I did some searching for some NTFS undelete
utilities, and I don't remember what I found, but I undeleted the
loopback file, and Ubuntu lived again.

OK, now you've read the background information.

Some time ago, I got a call from Adrien saying he knew someone who was
having recurrent unspecified problems with Ubuntu after having migrated
from Windows. I spoke to this person, and he described a pattern of
having his systems compromised in a consistent way, and how this pattern
continued unabated after migrating away from Windows. I took a look at a
tower and a laptop. The tower had certain obvious issues stemming from
compounded user errors, but nothing blindingly obvious as an attack
vector. The laptop was peculiarly not starting X, which seemed to simply
die inexplicably when lightdm was supposed to start. After poking around
on the laptop console, I found that maybe half the dependencies of the
ubuntu-desktop metapackage were not considered by dpkg to be installed.
So, I did "apt-get install ubuntu-desktop" and after it was done, a
reboot brought the system to life, with all its old settings. It seemed
to work just fine at that point. I engaged in some application installs
on the tower, withing the user account. Then I let things be. I come
back after a while, and the tower seems to be missing much of what I put
there, and the laptop no longer has working audio. Looking at the
laptop, I find that huge swaths of /lib/modules is simply gone,
including all sound drivers. Puzzled, I did a "sudo dpkg -i" of the
current kernel, which was still in /var/cache/apt/archives, and sound
worked again. Firefox also seemed to not be installed on the tower. It
was "rc" in "dpkg -l" and a "sudo apt-get install ubuntu-desktop" put it
back. The Windows installation I did under VirtualBox seemed to be gone
altogether from the user's home directory. I did "locate -i virtualbox"
and it showed that it was still there. I did similar queries with other
things that were obviously missing, also showing still there, along with
contents of hidden dotfile directories that "ls -a" showed no trace of.
I know that Ubuntu updates the mlocate.db at 8am sharp every day if it
can. I asked whether the computer was on at 8am on that day. He said it
was, but he shut it down later in the morning. The modification
timestamp confirmed this. So, something had deleted large swaths of the
file systems of both the tower and the laptop within the preceding short
amount of time, when no one whatsoever had any physical access to either
machine until I was there, and the laptop also had no network
connection, with rfkill constantly in force, and Ethernet not plugged
in. He stated that this was an example of what keeps happening, and that
it used to happen under Windows as well, when it ran on bare metal. Of
course, given the background information above, I figured I should ask:
"When you shut down the computer, exactly how do you usually do it?" You
can take a wild guess what his answer was…

This also leads me to believe, without having actually looked into it,
that certain files disappearing may cause dpkg to consider a package to
be in the "removed" state. Based on what I have heretofore known of
dpkg's inner workings, I would have assumed otherwise, but this
situation says to me that maybe dpkg actually does work that way.

In summary, ext4 is Not Ready for Prime Time. Any ideas what to replace
it with, folks?






More information about the conspire mailing list