[sf-lug] How linuxmafia.com got back to being operational again :-)

Thu Jan 22 01:25:36 PST 2015

And yes, of course *much* thanks to Rick!  Without who's support,
hosting, equipment, etc. this would not be possible.

Anyway, I figure at least some folks will be rather or more curious
about some of the details on getting linuxmafia.com back in operation.
I don't want to provide *too* much detail (could get overly long, might
cover details not appropriate, etc.).  But here's a relatively
high-level overview of how it went (okay, still fair bit of detail, but
anyway ...).  I'll mostly emphasize the software/data bits, and not so
much the hardware, though both were involved.

Basic scenario (light on details, see some of Rick's earlier emails
(e.g. on the BALUG-Talk list) for more details on what had occurred).
Some updates/installs were being done, while that was in progress,
serious hardware issue(s) occurred, apparently due to line power issue
and apparently leading to failed hardware.

At least the point at which I picked it up ... had gotten some more
detailed information from Rick - mostly notably relevant parts of a
session log from the updates/installs that were in progress then the
hardware issue occurred - and in fact covering from an (apparently)
known good state before the start of those updates/installs, through to
apparent hardware failure, and also at least basic descriptions of
observed host behavior after that and if I recall correctly, some bit of
evaluation of status of disk data after that (filesystem integrity
apparently okay, but not able to boot from disk).

I coordinated with Rick for on-site visit to assist.  First order of
business was getting hardware working "well enough" to proceed to
software/data issues.  This took a bit longer than expected, apparently
complicated by (gu)estimates of state/condition of the most relevant
hardware and spare hardware were apparently a bit off, and there were
some intermittent issues that didn't immediately catch our attention -
so took us a bit longer to get a good (or good enough) hardware
combination together to make it viable to proceed onto software/data
bits.

Software/data bits.  In advance, I'd analyzed the session log
information Rick had provided me.  I used that to construct an at least
theoretical package rollback scheme to get packages and versions back to
the earlier "known good state".  Host is a Debian host (not secret, it
also so claims on some of the mailman pages).  Part of my preparation on
getting back to "known good state" also involved assembling the relevant
packages and versions.  I've quite a number of ISO images :-)
http://www.wiki.balug.org/wiki/doku.php?id=balug:cds_and_images_etc
... including lots of Debian, and especially non-ancient Debian.
I first searched all my Debian ISOs for any of the relevant packages and
versions (along with also checking my /var/cache/apt/archives).  That
covered many, but not all of the packages (I wanted to get not only
packages that were installed or partially installed or attempted to have
been installed, but also the versions before the install/upgrade or
attempt(s) thereof, so if/as needed I could potentially roll any given
package not only back, but also forward, if needed).  That covered
many, but not all of the packages.  And thank goodness it was Debian :-)
(or Ubuntu might've been about as easy).  Debian has the most excellent:
http://snapshot.debian.org/
Which in this case was highly useful for being able to get exact
versions of specific binary packages - e.g. if something came out as a
security update or bug fix release, but was superseded before it made it
to a CD ISO point release, I could still get the exact binary package
version I was looking for from http://snapshot.debian.org/
Note that *many* distributions don't have these types of services and
archiving ... though some do, or may partially cover that (e.g. some
at least keep older ISO images around - Debian also keeps the jigdo
files around to be able to reconstruct older Debian ISO images).
So, I'd gathered up the relevant packages, and, also constructed, an at
least theoretical "roll back" procedure to get to the earlier "known
good state".

So, then, initial attempt to get back to "known good state".  I'd
prepared Debian CD which seemed to best correlate to the version/state
of most of the Debian installed on the host (before the install/upgrade
stuff when the problem occurred - that involved a comparatively small
number of packages compared to all the packages installed on the host),
at least in terms of seeming a most likely best useful Debian CD for
doing recovery (not sure if the optical supported DVD, so I prepared CD
as more probable to be supported).  I'd also inquired in advance about
USB and boot support thereof, and that seemed it (at least boot)
wouldn't work on the hardware's USB.

Proceeded to boot from CD, mounted the regular host's filesystems
under /target (following Debian convention), connected USB with the
prepared files, made a location on target host to store them and
transferred them there, did a chroot(8) into /target:
# chroot /target
And then worked the theoretical "roll back" procedure.
First glithches, needed some additional package versions I'd not already
downloaded ... so, downloaded, placed on USB flash, move that over to
target host, copy, continue ... after bumping into that more than once,
worked to get our target hardware system networked, to ease transfer
(starting with Ethernet crossover cable to my laptop).  Got the
"missing" needed packages, and continued, ... that went fairly well
along down that path, package, by package ... until it didn't.  One of
the more critical packages being downgraded (downgrades are always a
bit risky and officially not supported), led to the chroot environment
being broken.  Couldn't continue (quite) as planned at that point.

Next phase and considerations/rationale.  Were it much more modern
hardware with generous RAM, I might've done something like put Knoppix
DVD image on USB flash, boot from that loading image to RAM to run from
there, and use that as repair/recovery environment.  Well, we didn't
have that kind of luxury on this hardware.  Also, that would still have
the disadvantage that that image running in RAM would be volatile - no
persistent storage (though we could also store things to the target
host's disk(s).  So, for next phase, I planned as follows.  The target
host hardware had both SCSI (two SCSI buses if I recall correctly, but
only one being used), and IDE/(P)ATA, and normally it would just have
SCSI hard drives, but it could also boot from IDE/(P)ATA drive if
installed.  So, ... back on my laptop, I used virtual machine
capabilities, to build a virtual machine, using (sparse) flat file as a
raw hda(IDE/(P)ATA) drive image - and I sized that logically to
precisely match the size of the smallest capacity such drive I had which
was "big enough" to do what was needed.  After successfully creating the
virtual machine, I shut it down, blasted that raw hda image to physical
drive (via USB to IDE/(P)ATA/SATA adapter), reconfigured the virtual
machine to use that physical drive (to test it out, etc.), and then for
the next on-site visit, we installed that drive onto the target host
hardware (connected it in place of the optical drive).  Made some fair
bits of progress on the physical target host, however it was a somewhat
slow divide-and-conquer process.  A bit further into that process (and
something I may have thought of a bit earlier), to be able to both
continue working on it more conveniently, and also to be able to much
better roll-back/roll-forward, we essentially came up with the following
approach, with Rick's approval:
I'd suck the target host's filesystem images (at least all the relevant
filesystems - omitted two filesystems not relevant to fixing the
problem issues/areas or potentially involved), and also nominal boot
drive relevant data (grabbed everything before the first partition).
I'd also grab the target host's hard disk partitioning data and the
information about the precise size of the drives (and also the SCSI ID
information).
I'd put those onto a large capacity drive I brought - and with all the
data encrypted on drive (LVM atop LUKS/dm-crypt).  I'd then use that to,
at my relative leisure, set up virtual machine, work out all that
needed to be done with the software/data to fix the issue(s) with the
software/data, then we'd apply that "fix" to the actual physical
machine.  Transfer was done via an Ethernet crossover cable from the
target host, through my laptop and onto large capacity hard drive.  Rick
and I also set up the target host, still running on the IDE/(P)ATA drive
I'd brought that we booted it from, back on The Internet using the
linuxmafia.com IP address, with ssh running, and with a secure temporary
root password known to just Rick and myself.  From there if I had any
need to, I could also inspect the physical system, and could even do
much of the repair work remotely, once the "recipe" needed to fix
everything was worked out.

So ... work continued after that, on virtual - but highly similar to the
physical.  One of the great advantages of the virtual (in addition to me
being able to work on it at my convenience), and along with large
capacity hard drive, I would preserve original image same as it had been
left on the physical host, would copy that and use the copy on virtual
machine, and could work in various attempts to get everything working
again, and I could always conveniently go back to and create fresh copy
of the original data, reapply the theoretical set of fixes, until the
fix "recipe" I set up (actually a pair of scripts), would do the full
repair against the original disk images (which were left unchanged on
the physical until an agreed upon and tested (on virtual) solution was
worked out.  There was a lot of divide-and-conquer on the virtual,
isolating exactly what needed to be done, and in what sequence.  But
after some fair bit of time (I typically nibbled at the issue 20 minutes
to 2 hours at a time, at my relative leisure), I eventually came up with
well tested scripts on the virtual, ran it by Rick, and then with his
review/approval, ran it on the physical.  After that, it was mostly a
matter of Rick rejiggering the hardware slightly (remove the IDE/(P)ATA
drive, reconnect optical, eject optical tray or disk), boot, and all
should then be good again.  That's essentially what we did.  And the
"known good state" - to get repaired and working again, was nearly
identical on packages and versions before the software installs/upgrades
and hardware failure.  There were a (very) few packages that needed to
be updated/installed beyond that to get everything working again.  They
may or may not have been needed before in the earlier "known good
state", but after all the mucking about by both of us ;-) and repair
attempts, methodology, state, etc., a few package updates/installs were
needed to get everything happy (package states all good and consistent)
and host properly bootable.

And that's the short version.  ;-)  Most 'o the rest is details (and
there are lots of those).

Teensy bit more detail/commentary ;-) ...

One of the things I *really like* ("love" :-)) about Linux (and Unix),
is that it is quite doable (though not necessarily trivial) to be able
to dig, if and as needed (or desired), down as deep as one needs to go,
to figure out what's going on, or going wrong, and, e.g. find and fix
what's broken.  With certain common highly closed proprietary operating
systems, that just isn't possible like that - at least outside of said
company itself (if they even bother) - one hits brick wall where even
their highest levels of very expensive support just won't tell you - and
there really isn't feasible way to determine.  Not the case with Linux -
one can go as deep as needed.  And similarly with Unix (or any
proprietary bits on Linux), can always isolate down to highly localized
- in "worst case" one can get it down to very specific closed bit - what
goes in, and out, what it does and doesn't do - as expected or needed or
not.

And ... this particular software/data/state repair I thought
particularly interesting and reasonably challenging, and given what that
host normally does for so many, a task well worth doing and taking on
(not only to get it operational again, but to also demonstrate the
feasibility of such software/data repair).  And by also repairing in
this manner, avoid all the work needed to do a rebuild/merge etc. type
of recovery.  If the backup scheme had been a bit different (e.g.
"backup everything"), there may have been other recovery options
available (and there were also other viable recovery options available,
and quite adequate backups).  But many environments, often very
appropriately don't "backup everything" - as much of that is redundant
and/or quite easily reconstructed (e.g. if you have 1,000 hosts running
the same version of the same operating system, how many copies of
identical binaries would you want?).  Anyway, this recovery/repair
operation was mostly optimized to get it working again quite as it was
before, avoiding all the rebuild/reconfigure/merge issues, and also as a
fun (hey, at least for me! :-)) exercise in working through fixing a
"broken" Linux installation.  That however, doesn't necessarily mean all
broken Linux installations are that easy (or feasible) to recover from.
E.g. if someone has a whole lot of mixed distribution repositories
from different Linux distributions and 3rd parties, various random
stuff grabbed and compiled from source and thrown on - especially of
dubious and/or probable to be conflicting sources, and little to
nothing in the way of backups and/or relevant log information, I'd
typically look at something like that and say "no thanks" (that doesn't
mean such would be impossible - just would be much messier, and much
more difficult to even know what the end target should be).  So, ...
this one was a good one to fix :-) - good backups, good logged* stuff,
rather clean on the distribution stuff as far as what software from
where, excellent support from Debian itself (notably
http://snapshot.debian.org/ for this scenario) - and of course useful
host/service to many, once operational again.  *Logs - I'm sure Rick has
made the point many times, and he and I did also briefly touch upon it.
Not just the stuff the host logs, but "keep notes" - or the equivalent.
E.g. on paper or in a system log book, or some other form(s) that will
be readily accessible and usable when the host is down or if the data
on its local storage is toast.  As for myself, years ago and for many
years, I did hardcopy log - some Unix vendors would even give one a log
binder, well labeled, with divided sections and blank (or blank form)
pages in them to help encourage such practice.  I eventually migrated to
scheme where such log information goes to flat files, and those are
regularly backed up - and in such manner that can be retrieved even if -
and especially when, host totally fails (if one has missed that key
point, one has overlooked a key need of such logs - to have 'em usable
when the host is toast).  We also do something quite similar for SF-LUG
and BALUG - their virtual machines have such logs - which are regularly
backed up, and also, for the curious, most of that log information is
publicly accessible:
http://www.sf-lug.org/log.txt
http://www.archive.balug.org/log.txt
and going back a bit earlier, there's:
http://www.wiki.balug.org/wiki/doku.php?id=system:change_log
Also, in case folks are curious, I wrote Perl program that can take much
of the output of apt-get or aptitude, from
installs/upgrades/removals/purges, and turn that into a much more
concise human readable summary form.  Not too long ago I also altered
that program so it can likewise also parse dpkg log output (e.g.
/var/log/dpkg.log) and likewise summarize in quite the same concise
human readable summary form.

On possibly slightly more accurately (and/or easily) better figuring out
the "known good state" regarding packages and versions, made use of
inspection of the /var/log/dpkg.log* files.

On preparing the virtual disk images, used sfdisk -uS -d to save the
partition data earlier, and likewise, sfdisk -uS to recreate identical
partitioning on virtual drive images.  So, the data on the virtual
drives were identical - except they omitted the two filesystems not
covered, and anything outside of the partitions - except I did also save
the data before the first partition on the boot drive, and did also
write that to start of that virtual disk image.  There was also md
involved, so reproduced that highly similarly, based on target host data
(matching UUIDs and md format version number, and underlying storage on
corresponding virtual partitions).  Note also one wants to be careful
with UUIDs - theoretically they should *always* be unique, and *not*
duplicated.  One certainly doesn't want two matching UUIDs on the same
host - that's generally quite asking for trouble.
Use of:
# losetup -f --show some_image_file
# partx -a some_loopback_device_of_partitioned_disk_image
# dd if=/dev/zero bs=some_fair_size_for_the_physical_media \
> of=/some/temporary/file/on/the/filesystem
(and then once dd completes, remove that temporary file)
# cp --sparse=always
Were among some of the additional bits very handy for setting up the
virtual disk images working from filesystem images (and image of the
data preceding first partition on the SCSI boot disk), and also holding
down (or at least shrinking back) physical storage space needs while
working with these virtual images.

divide-and-conquer
http://en.wikipedia.org/wiki/Divide_and_conquer_algorithms

jigdo
http://en.wikipedia.org/wiki/Jigdo

There were enough "pieces" to this "puzzle", that some of the logical
isolating of what was/wasn't needed to do the fix or certain portions
thereof (or how to progress towards fix) I even did programmatically.
E.g. in pseudo-code, somethings
like:
while more elements in set to try
test if it's still broken, if it works, report works and last bit done
and exit loop
for set of elements applied that resulted in working
remove element
retest, if still working, try next in loop, else
if broke after removal, report what was removed, that it broke, and
exit loop

Ah, and I should've noticed Rick's email to the list a wee bit earlier,
... but I didn't.  For better or worse (actually some of both) I
typically have digest mode set on for the SF-LUG list (and still had and
have it set that way).

references:
http://linuxmafia.com/pipermail/sf-lug/2015q1/010647.html