diagnose/fix boot/software issue: (e.g.) linuxmafia.com host

Michael Paoli Michael.Paoli at cal.berkeley.edu
Sun Jan 4 00:17:26 PST 2015


Rick,

Just a thought, if it might be useful/helpful.

Lots of "if"s, including the above ;-) but ...

I was also thinking ...

if most or all of the problem to be resolved with the host system that
experienced the failure and is (presumably) still having boot issues
("bizarre GRUB errors" ...) are software and (at least mostly) not
hardware issues ...

I was thinking may be able to effectively "crowd source" much of the
work on diagnosing and finding relevant fix/correction, namely, I was
thinking ...

could provide the relevant data, folks could examine it on virtual (or
separate physical) machine, and find fix/solution(s) to the issue.

Bits we'd need or would likely also be quite helpful to analyze and find
solution:

Actual data - since I'm presuming it's a "boot" issue still, failure
occurs prior to kernel successfully loading.  Presuming it's software,
that does relatively narrow down scope of data to something not too
huge.  Could upload and make publicly available (presuming no data bits
within contraindicating such a move):
detailed low-level partition information, e.g.:
# sfdisk -uS -d /dev/sda
(presuming legacy formatting, and sda)
all the data from start of disk up to (but not including) first
filesystem/partition on disk, or a few MiB of such data, whichever is
less.  Most notably that would include MBR, and any other bits GRUB
might squirrel away on disk there.
"boot" filesystem.  If /boot is a separate filesystem, complete image of
that filesystem, e.g.:
# dd if=/dev/sda1 | bzip2 -9 > boot.fs.bz2
If /boot is not a separate filesystem, but is on the root (/) filesystem,
then instead, provide:
full backup of /boot contents (e.g. pax, cpio, or tar archive of contents)
and also:
the first few MiB or so of the raw filesystem image (notably to get any
bits GRUB sticks on there in "reserved" areas.
And dump of relevant information for that filesystem, e.g. if it's
ext[234] filesystem:
# dumpe2fs /dev/sda1
Or similar details if it's some other filesystem type.
Also from the root (/) filesystem, any other relevant GRUB configuration
bits, e.g. often found somewhere under /etc - can tar up and provide the
relevant file(s) covering that.
Also, if md or LVM are used for any of those filesystem that may be
needed mentioned above, the relevant md/LVM information as relevant.
If hardware RAID is involved, probably don't need that information, but
just need what the OS/software logically sees of the drive(s)
Also would be potentially highly helpful:
OS distribution and version that was being upgraded from when issue occurred
OS distribution and version that was being upgraded to when issue occurred
Likewise, GRUB version, and version being upgraded from, and to, when
issue occurred.
If some of that version information might not be fully known, reasonable
approximations (and indications of such) would still be quite useful, e.g.
on/about YYYY-MM-DD was upgrading from <distribution> <version> to the
then most current version (or version <version>), the to/from GRUB
versions would be those applicable for the <distribution> versions going
from and to.
Also, some hardware information might also be helpful - probably don't
need to be too detailed, but most useful I'd think would be size of host
RAM, CPU type/family (e.g. Intel 64-bit or 32-bit), and drive controller
type/interface (IDE/PATA/SATA/SCSI/...).

Anyway, I was thinking, if you're able to pull that data off drive and
upload it somewhere for us, we might well be able to figure out boot
issue and corrective measures - and may involve less total person-hours
in cold garage working to determine fix for the issue.

Not at all that you have to :-) ... but I was thinking it might possibly
get to "fix"(ed) sooner and easier that way ... and if nothing else,
thought it may be useful to illustrate to folks that such approach can
be used to diagnose issues and test out fixes to a software issue (at
least if the issue doesn't have specific hardware dependencies).

Also, in any case, having all that backed up can also allow one to
return to that state, if no changes beyond that data are made.  (Do have
to be rather careful though, with "reserved bits" written in reserved
area of filesystem, outside of (before) partitions, etc.  I'm not
spelling out all the details on that here, though).

> From: "Rick Moen" <rick at deirdre.net>
> Subject: Re: It's a gift (not a newsletter) ; and an offer from SF-LUG
> Date: Tue, 30 Dec 2014 13:56:21 -0800

> On Tue, Dec 30, 2014 at 1:14 PM, jim <jim at well.com> wrote:
>
>> A couple of meetings ago, a few SF-LUG folks agreed to
>> purchase some old box in good working order and with
>> sufficient resources to host a MailMan system. Rick, if
>> this offer will help you, please let us know: we're willing
>> to find, vet, purchase, and deliver. I'm interested in
>> seeing if I can provide an electrical processing system
>> that can protect your machines from over- and under-
>> voltage mishaps.
>>
>
> Hey, thanks to all of you for the lovely and thoughtful offer.
>
> Thing is, I actually do have a bunch of hardware sitting in my garage.  At
> least one of them is very likely a functional 1U or 2U rackmount server,
> which is the right sort of thing to use.  (Many desktop boxes have things
> about them that make them unsuitable, such as many desktop machines' ATX
> power supplies not being able to be configured to bring the machine back up
> without manual intervention when the power returns after a power outage.)
>
> Just before I went on my last vacation, I moved the hard drives from my
> server from the failed VA Linux Systems model 2230 to a spare model 2230.
> To my relief, I got video and was able to boot an Aptosid live CD.  Even
> better, I was able to mount my server system's partitions, verified that
> they were readable, and update my backups of everything.  Thus, at that
> point, I was no longer in danger of having to revert to an old backup.
>
> Using the live CD, I then attempted to fix the software problems that were
> the _other_ issue aside from failed hardware.  (To recap, I had been doing
> system updates, and (skipping some details) the system segfaulted in the
> middle of the system software upgrade. I cold booted, but there was from
> that point forward no video at all, nor beeps, i.e., it acted as if I'd had
> failure of the motherboard or other key system hardware.)   I was not able
> to find a way to make the system bootable through some hours of
> experimentation - was getting some bizarre GRUB errors - and had to defer
> the matter because I had to leave to catch our flight to Barbados.  So, I
> powered down the machine.
>
> When I got back from Barbados, I found something perplexing:  I heard the
> system fan running, and saw the blue power light on the front panel, i.e.,
> it was powered up (even though I'd left the system powered down).  However,
> despite that, there was no video.  Cold booting the system resulted in...
> no video.  This was really bizarre.  The symptom suggested that there had
> been a power outage during my time in the Caribbean, and upon the return of
> power, my system had come online (I hadn't unplugged it, just powered it
> down), and that there had then been a second and similar hardware failure.
> But this seemed like an implausible coincidence, as perhaps you would agree.
>
> Time and experimentation and use of careful logic can get to the bottom of
> the matter.  I just haven't lately had the patience to do that, and have
> been quite busy with other commitments in the meantime.  Sooner or later, I
> _do_ plan on sitting out in my very cold garage for as long as it takes.  I
> certainly could give up on debugging the VA Linux Systems gear, and just
> attempt to build from scratch a replacement software configuration on one
> of the other spare machines I have.  I'd prefer not to do that, because
> building a new server configuration instead of just tracking down the one
> software problem that made my system unbootable is a LARGE amount of extra
> work.
>
> And, thus, you'll notice, the resource I'm short on is not machines, but
> rather time, patience, and focus on the problem.
>
> About over/under-voltage:  Last year, concerned about that very thing, I
> set about dealing with that.  First thing I did was to buy an APC UPS unit
> over at Central Computer.  However, this never seemed like really the right
> solution, just the commercially easy thing to acquire:  A UPS isn't
> actually very great at dealing with power fluctuations (and sometime is
> useless at that, depending on the type), and also interposes a new single
> point of failure in the form of a big lead-acid battery that can, itself,
> bring down your system.  Also, the UPS generates quite a bit of heat, which
> bloats your PG&E bill, and you have to buy replacement lead-acid battery
> packs every few years, which are a large percentage of the cost of the
> entire UPS, each time you have to buy them.
>
> What the UPS mostly does - the problem that it exists to solve - is bridge
> you across short-duration outages, making it so you don't lose power and
> have continuous uptime.  Continuous uptime is abstractly nice, but is the
> thing I care least about:  Linux servers come right back up after power
> returns.  That's what we have journaled filesystems for.  So, given that
> fact, why would I want to put a continually expensive, heat-producing,
> potentially problematic bit of hardware between the AC outlet and my unit,
> one that isn't even very good at line regulation, and that can be a Single
> Point of Failure that otherwise wouldn't exist?
>
> In short, I have not been in a hurry to deploy the UPS, because it's mostly
> a solution to the wrong problem, a solution to a problem I don't care about
> very much.  On reflection, I realised that the right solution is a line
> conditioner unit, not a UPS.  And I don't mean the miserable rubbish you
> can get at Fry's, either.  The problem was:  Where do you get a line
> conditioner of the variety that people acquire who are serious about the
> problem?
>
> Last summer, I solved that problem:  I went to the De Anza College
> Electronics Swap, very early in the morning, and found a vendor who was
> selling a ham-radio-grade line conditioner unit.  I have that with my gear,
> and expect to use it going forward.
>
> Thanks again.




More information about the sf-lug mailing list