[sf-lug] Hayes Valley Mystery

Wed Apr 16 22:49:57 PDT 2008

Hi Folks,

I thought this might be an interesting one to share with the list. It's
a bit of a shaggy dog story, so bear with me... 

A few days ago, myself and Jim Stockford started getting alerts from the
Nagios instance setup on the sf-lug box telling us that there was a
problem with the server at the Hayes Valley project we (and other people
on this list) have been volunteering at. Fair enough. I was planning to
stop by there on Tuesday in any case, so figured I'd check it out then.

When I arrived, it turns out that the place had been re-painted and
during that process all the computers had been moved around, cables
unplugged, reconnected, etc. Basically a complete mess. This included
all of the workstations in the public area of the community center, but
more importantly, it also included the server in the server room. The
server is configured as the network gateway, providing DNS, DHCP, as
well as "content filtering" through DansGuardian in a transparent
proxying setup. So basically the fact that this server was down meant
that the entire network was down, so it became my first priority.

First of all I sorted out the cabling mess, and then booted the server.
The boot process didn't complete and I was dropped into a recovery
environment with limited commands available to me. I was able to see all
the drives on the server, mount and inspect each one, and verify that
everything seemed okay. Except that obviously everything wasn't okay.
The recovery console had been preceeded by  a message about the mdadm
devices not being correctly configured (software raid). 

To make an extremely long story not quite so long, we were able to get
the server back up and running by booting into an older kernel (manually
applied updates had installed a new kernel in the 120+ days since the
server was last rebooted, and we thought it might be worth trying the
older kernel, which sure enough it was). 

So at this stage we had a server booting fine. Almost. We realised that
we would want to change the default kernel to be the older one so that
you would be able to perform an unattended reboot. At the moment, the
default kernel was the newer one that was having problems recognising
the software RAID devices, and so couldn't boot correctly. So we thought
it would just be a simple matter of editing /boot/grub/menu.lst. Only
problem is, /boot was empty. How so? We happened to know that /boot
should be /dev/sda1, so we mounted that to the /boot folder, and then
edited the menu.lst file as above to use the correct kernel. We then
edited /etc/fstab, which sure enough had the entry for /dev/sda1 to be
mounted as /boot commented out. Simple case of uncommenting the entry
and rebooting, surely?

Except when we reboot, it fails, saying there's a superblock error
(don't remember the exact error message) with /dev/sda1. All other
filesystems are mounted, but not /boot. It recommended something like
running fsck against /dev/sda1, but checking for a different superblock.
Unfortunately I don't remember the exact error.

So the questions here are:
• How did the system boot from a device that it failed to mount (we know
it was booting from /dev/sda1 because the changes we'd made
to /boot/grub/menu.lst when we manually mounted /dev/sda1 before
rebooting were applied)?
• How can we mount a partition if it's failing to be mounted as part of
the boot sequence?
• What checks can we do on the filesystem to confirm it's all good?

We have some theories, but thought it was an interesting one to throw to
the list. Enjoy...

Cheers, Tom