[sf-lug] Red: How linuxmafia.com got back to being operational again :-)

Rick Moen rick at linuxmafia.com
Sun Jan 25 23:09:17 PST 2015


Sorry to break threading, but in this case I'm responding from the Web
archive.  Note Daniel's point about how Gmane.org can help mitigate
this effect.  Good point, Daniel (though I don't promise I'll ever post
from GMane.org).

Michael Paoli wrote:

> At least the point at which I picked it up ... had gotten some more
> detailed information from Rick - mostly notably relevant parts of a
> session log from the updates/installs that were in progress then the
> hardware issue occurred - and in fact covering from an (apparently)
> known good state before the start of those updates/installs, through
> to apparent hardware failure, and also at least basic descriptions of
> observed host behavior after that and if I recall correctly, some bit
> of evaluation of status of disk data after that (filesystem integrity
> apparently okay, but not able to boot from disk).

Correct.  Just a comment about the session log:  I've long been in the
habit of doing any major system maintenance under GNU script, which
utility opens a locally logged subshell that captures everything to disk
on a keystroke by keystroke level.  That's a pretty good idea.  However,
in this one case, I happen to have been doing client-side (rather than
server-side) logging of my upgrade session on my laptop, from which I
was ssh'ed in.  The fact that session logging was _not_ being written to
disk on the server turned out to be serendipitous, a fancy word for
'dumb luck'.  ;->  [1]  That meant that I had trivial access to the log
even though the server was down.

> ...apparently due to line power issue and apparently leading to failed
> hardware.

This is difficult to know.  Over many years, we've had a couple of
failed motherboards at the house, which was enough to make me at least
wonder about PG&E supply glitches.  The West Coast and especially the
suburbs just aren't known for them (not being Long Island or a SSF
industrial park), but who knows?

Any 2U rackmount hardware (including PIII-based Intel L440GX+
motherboards) that's seen a lot of use for 14 years is
probably starting to be fragile, so having a couple of 2001 motherboards
fail in quick succession in 2015 is not _necessarily_ a suspicious
coincidence.

I actually bought a standalone AC power regulation unit.  Later, while I
was at the De Anza College Electronics Flea Market and misremembering
what I'd bought earlier as a UPS, I bought an even better one.  But I'd
not yet deployed either of them.  One of the reasons is that I try to
follow the principle of not changing more than one thing at a time, and
not disturbing something that's working.  Also, at that time, I greatly
lacked the time and stamina to deal with any sudden system failure that
might result from bringing the server down for rearranging things.
Also, the KISS principle suggests not adding yet another component that
could itself be a single point of failure (SPoF) unless quite certain
it's needed.  And also because I get tired of being peppered with calls
to my cellphone (as often happens) from people saying 'Did you know that
your server is down?' as I'm trying to concentrate on critical work.

In any event, this past Saturday, I finally did deploy the voltage
regulator unit between the server and PG&E.  And also replaced a failed
hard drive that had left the server's RAID1 array in degraded (missing
redundancy) mode for quite some time.  Which is the first time I've had
to deal with that -- having in professional life dealt only with
expensive hardware RAID in the past.  I have to say:  sfdisk and mdadm
really rock!  (I'd never had occasion to use the former at all, and the
latter I'd used only to assemble the mirror pair, and to monitor it.)


Also, and I offer this advice to anyone who wants computing gear to
last:  A lot of it is preventing heat buildup.  Heat buildup kills
electronics.  It might even be _the_ biggest killer of electronics.  
Even if it doesn't die right away, the circuitry gets stressed and will
fail a lot sooner than it otherwise would.

Having been an employee of VA Linux Systems in the late '90s and 2000s, 
I was really bummed out when I realised that VA Linux's server boxes, 
_many_ of which I had as leftovers -- and also their workstation boxes,
of which I kept one as a curio -- used junk-quality case fans.

Cruddy fans will last for a few years, though they'll be noisier than
they needed to be.  The key area of parts compromise is the bearings:
The junk ones have 'sleeve bearings', which build up friction and then 
seize up solid.  Having frozen solid, they no longer move air but then 
_also_ the fan motor continues to itself generate heat, so at that point
having a bad-quality fan becomes actually worse than having no fan at
all.

1U and 2U rackmount servers need to have tiny case fans that blow air 
through holes in the front and/or back.  They have to make up for being
tiny by rotating very rapidly, which means that bearing quality becomes
all the more important.

I suddenly noticed how terrible the case fans on my VA Linux Systems
model 2230 were around 2007.  After shaking my head in disappointment
over my ex-employer cutting corners just to save 50 cents on a specialty
server box, I hied myself down to Central Computers in Santa Clara and
bought _good_ replacement fans -- Antec brand, I think.  Replacing them
was stupid easy.  As a side-benefit, the system became significantly
quieter, too.

> I'd prepared Debian CD which seemed to best correlate to the
> version/state of most of the Debian installed on the host (before the
> install/upgrade stuff when the problem occurred - that involved a
> comparatively small number of packages compared to all the packages
> installed on the host), at least in terms of seeming a most likely
> best useful Debian CD for doing recovery (not sure if the optical
> supported DVD, 

The TEAC ATAPI drive on a 2001-era VA Linux Systems server was pre-DVD.
;->  However, if we'd needed to boot from DVD, I had a few CD/DVD (and
one CD/DVD/Blu-Ray) ATAPI drives sitting around we could have cabled in,
easily.

Those are the sort of one-the-fly kludges you can do easily with a
typical workstation or server box.  Not very much so with laptops, and
I'm unsure with how hardware-adaptable current all-in-one units are.

> connected USB with the prepared files

Notice how wretched USB 1.1 transfer speed is, when you're spoiled by 
better things?  That's one thing about the 2001-era motherboards I'll be
glad to see the back of, when I migrate off the current system in the
relatively near future.

One of the _nice_ things about PIII-based motherboards, and one of the
reasons I always hesitated to deploy one of the P4-based Dell PowerEdge
rackmount servers I have as spares, is that PIII systems didn't suck and
waste electrical power.  P4 and related systems were power gluttons by
comparison.  Intel didn't start getting serious about CPU power wasteage
until they started getting customer pushback because customers were
finding that a rack full of 1U servers was trying to draw more power
than the rack's 48 ampere feed could supply.

These PIII-based systems did max out at 2GB RAM, so VM-based setups were
not in the cards, but (leaving aside accumulated stress to 14-year-old
parts) they're more than adequate to run a home Web / mail / mailing
list / ssh / DNS / rsync / ftp server on an aDSL line.

> I'd put those onto a large capacity drive I brought 

And how.  Kids, one of the reasons to never get stuck in nostalgia about
computing gear is that improvements over time can make huge differences.
My server had the best three SCSI drives I was able to scrounge together
in 2003:  a 73GB boot drive and a mirror pair of 18GB drives to hold all
of the important filesystems (/var/www, /home, etc.).  Plunking down
alongside that a really cheap modern ATAPI drive that for peanuts gets
you 500GB or more really puts change in perspective.  (Of course, this
came as no surprise, if only because I use cheap 2TB external USB
drives for backup.)


> One of the great advantages of the virtual (in addition to me being
> able to work on it at my convenience), and along with large capacity
> hard drive, I would preserve original image same as it had been left
> on the physical host, would copy that and use the copy on virtual
> machine, and could work in various attempts to get everything working
> again, and I could always conveniently go back to and create fresh
> copy of the original data, reapply the theoretical set of fixes, until
> the fix "recipe" I set up (actually a pair of scripts), would do the
> full repair against the original disk images (which were left
> unchanged on the physical until an agreed upon and tested (on virtual)
> solution was worked out.

This really was a brilliant tactic, and I was impressed.  As you say, it
means you don't have to stress out about screwing up with some
irrevocable repair command.

As it happens, I had already triple-checked my backups and would have
been perfectly fine with the outcome if the physical server had gotten
somehow clobbered during repair, because I had an already battle-tested
alternative (which I've used in the past) of just building a new
installation, configuring services from scratch, and restoring data
files.

As a note about that:  Now that I'm no longer horribly neglecting home
computing infrastructure in favour of work, I hope to set up Puppet or
Ansible configuration management on a separate low-power host in the
back of the house, so that deploying a new linuxmafia.com machine if
required becomes a snap -- and also so that configuration state is
carefully managed and controlled.  (I already use Joey Hess's etckeeper
to version-control system configuration files, which is the low-hanging
fruit and far less complex even than Ansible.)


> And ... this particular software/data/state repair I thought
> particularly interesting and reasonably challenging, and given what
> that host normally does for so many, a task well worth doing and
> taking on (not only to get it operational again, but to also
> demonstrate the feasibility of such software/data repair).  

I'm glad you found the challenge inspiring, enjoyed your company, and
appreciated the generous effort.




[1] Actually, we owe the term 'serendipitous' to the 19th Century story
'The Three Princes of Serendip' (Serendip being an archaic word for
Ceylon or Sri Lanka).  In that story, the titular princes were always
making marvellously useful discoveries just in time.  Thus the word.






More information about the sf-lug mailing list