[sf-lug] Troubleshooting (was: memory integrity checking / excluding bad memory)

Sat Feb 2 22:09:49 PST 2008

Quoting Alex Kleider (a_kleider at yahoo.com):

> Several months ago on a Monday night at the JavaCat, Asheesh and
> Kristian were trying to install onto a laptop that turned out to have
> bad memory; the problem was eventually discovered by checking the
> integrity of the computer's RAM but if I wrote down the command used,
> it's been lost. 

I just remembered a couple of sets of lecture notes I wrote a long time
ago, that might be of interest.  Here's one:

Troubleshooting Hardware
Lecture Notes/Outline by Rick Moen

Abstract / Why This Matters:  There are times when it makes sense to
perform at least initial hardware diagnosis / triage; this lecture
suggests tools and techniques for so doing.

Table of Contents

Introduction
Overview
   "What Do We Know"
   Paths to Diagnosis
   Importance of Avoiding Inconclusive Tests
   Dealing with the Human Factor
   Replication
   Heisenbugs vs. Bohr Bugs
   Known-Good Items
   Elimination of Variables
   Coincidences, Common-mode Failures, and the Way to Bet
   "If So, Then..."
   Hearing, Smell
Common Problems
   Loose Connections
   Bad RAM
   Flaky Hard Drive
   Heat Buildup
   Marginal PSU
   Dirty or Failing CD/DVD Drive
Primary Software Tools
   Cerberus Test Control System
   Knoppix
      dmesg
      memtest86
      lspci, lsusb, dmidecode, X11 log
      /proc/meminfo, /proc/cpuinfo
      badblocks
   DOS CD, HD utilities
   BIOS Setup contents, ECC, etc.
   smartmontools
Primary Hardware Tools
   Reseating/Replugging
   Dust-Off
   Spare ATAPI CD-ROM drive and PATA Cable
   Spare PSU
   Other Obvious Tools

Introduction

Hardly anyone has diagnosis of computer hardware as part of his/her job,
and it's vital to avoid spending time on such efforts except where they
make business sense:  When in doubt, please consider carefully before
spending significant amounts of time working through these techniques,
to verify that they're worth the time and effort away from other
productive tasks you'll need to spend.

For one thing, much computer hardware is under maintenance contracts,
either pursuant to manufacturer / vendor warranty obligations or
separate IT contracts for hardware repair.  That having been said, there
are always situations where user time spent doing at least initial
hardware diagnosis is justified anyway, e.g., to short-circuit
finger-pointing between vendors or to overcome vendor reluctance to give
us satisfactory resolutions.

Overview

A former boss of mine in the "Management Information Systems" (IT)
department at now-vanished database firm Blyth Software, Mr. David
Carroll, had a characteristic expression that he would say, to nobody in
particular, whenever we were trying to solve a technical problem: "What
do we know?"  At first, I mostly classed it as a personal quirk, but
eventually I came to understand that it was an extremely helpful tool
for shortening diagnostic time: You think: What is the known history of
this problem, what are the variables, what assumptions am I currently
making, what are the suspects, what are the known-good elements? You
repeat that discipline at intervals, to make sure you're not haring off
on wild good-chases or wasting time doing something doomed to prove
ultimately inconclusive.

His other frequent observation on the matter was that "Any diagnostic
theory, if pursued logically enough, will eventually arrive at the right
answer." The trick, of course, was to both be sufficiently logical and
pick an efficient diagnostic path (series of hypotheses).

All of the other concepts I will outline below aim to apply those two
concepts.  (This section is of necessity purely conceptual, which I'll
fill in later.)

Carroll's caution about pursuing a theory logically implies carefully
avoiding inconclusive procedures:  If your answer to the "What do we
know?" question is the same before and after a procedure, then it's
wasted time.  It's important to remember this, e.g., if onlookers ask
you "Why aren't you doing [procedure foo]?, where [foo] has no
information value.

Suggestions from users and onlookers are very common and almost never
useful, since most people don't understand diagnosis and haven't
bothered to acquaint themselves with background facts of the case.  Be
prepared to summon up some polite equivalent of "Because I said so; now,
please let me work" for such situations.

Often, especially when the problem is one you yourself didn't discover,
the first and most crucial step is replication, i.e., observing the
problem symptoms for yourself.  It's a truism that technical problems
that cannot be reproduced also cannot be fixed, and technical support
departments the world over tend to have some problem-resolution status
analogous to the infamous "CNR" = Could Not Reproduce, for problems that
cannot even be observed, let alone fixed.

In that area, diagnosticians make a tongue-in-cheek distinction between
"Heisenbugs", i.e., bugs that are difficult to reproduce because they
behave in a Werner Heisenberg-like indeterminate fashion, not being able
to be produced at need, and "Bohr Bugs", those that are the opposite
(like Neils Bohr's competing model of the atom).

When you are starting a diagnostic track, you are trying to narrow down
what of various possible causes is producing the symptom.  Towards that
end, it's very useful to know, or to determine, which suspect components
are already known to be non-defective, called "known-good".  In general,
the process of diagnosis entails enumerating (in your head) all the
things that could cause the problem, and narrowing down the suspects to
the one that must be responsible.

Writer Damon Runyon once wrote that "The race is not to the swift, nor
the battle to the strong, but that's the way to bet."  Similarly,
successful diagnosis requires a good seat-of-the-pants notion of
probability: Diagnosticians tend from experience to strongly resist any
conclusion involving coincidence, e.g., that a failure was caused by two
parts failing at the same time, unless there is a plausible candidate
cause underlying all parts of the "coincidence", e.g., you decide that
two sticks of RAM are both defective, because they both turn out to have
been pulled from a pile of untested suspect parts.

Another important conceptual trait I'll mention is predictiveness, or
"If so, then...." If you're going on a hunch that your unit's
misbehaving because of heat buildup, then you should stop and think
about how such a unit would and would not behave.  Does the problem
manifest only starting half an hour after a cold startup and never
before?  

Your senses of hearing and (to a lesser extent) smell are quite
important, and mustn't be forgotten:  If you've gotten to know the
background sounds of a system when it's starting up, e.g., the
initialisation and seeking of the various hard drives, it really stands
out when those sounds are in part missing, or different.  Can you hear
the various fans starting up?  The hard drives sequentially spinning up
and seeking their respective sector zeros?  Smell is also significant in
the sense of helping you detect incipient overheating before it causes
disaster, and also noticing the moment of an electronic component's
death through electrocution or overheating:  The acrid smell a
semiconductor emits when it dies in that fashion is unforgettable, once
you've encountered it the first time.

Common Problems

1. Loose Connections

An astonishingly high percentage of hardware problems, maybe as high as
half, turn out to be curable by reseating major assemblies (RAM sticks,
cables, add-in cards).  You should be especially quick to check this if
there's any reason to suspect the unit has been recently moved or
jostled.

2. Bad RAM

Bad memory chips are increasingly a problem in computers, because of
pressure on RAM prices, higher densities, and relatively worse
fault-detection in most OSes.  Consequences of bad RAM only begin with
unreliable operation:  At worst, you can have gradual silent corruption
of stored data every time it passes through the defective RAM segment,
with the result that, once the data corruption is found, management must
choose between very old data on backup sets and more-recent but
gradually more corrupted data.

Most "memory-checking" routines in the Power-On Self Test (POST)
routines of computer boot-time BIOS Setup programs are worse than
useless -- useless because they give false assurance of quality, while
giving their OK routinely to grossly defective RAM.  The exception is
motherboard-based checking of ECC RAM, where ECC RAM is actually present
and where the full checking routines are enabled in the BIOS Setup
(which is often not the default).

Signs of bad RAM include:  segmentation-faulting processes, spontaneous
reboots, kernel panic messages entered into systems logfiles (which may
include "NMI: dazed and confused but struggling to continue"),
mysteriously dying processes.  Of course, there may be other causes of
these symptoms, such as NFS problems.

Sometimes, "shy" memory problem symptoms can be brought to the fore by
reshuffling the order of the RAM sticks in their sockets, or by testing
sticks in the minimum quantity at a time.  (Some machines need RAM
deployed in pairs, because of interleaving.)

3.  Hard Drives

Hard drives are the next most common cause of problems, but failures are
generally quite obvious.  You will either get inability to access the
drive at boot, or sector seek errors, or "sense key" errors.

4.  Heat Buildup

Heat buildup is the next most common problem, and one of the most
pernicious, because it weakens all components in the affected machine
and shortens all of their lifetimes.  Note that once one hard drive in a
system fails because of system overheating, any other hard drives in it
are at very high risk.  Be certain to check all internal fans, if you
have any suspicion of heat problems:  Being mechanical, they're prone to
failure from seized bearings.

5.  PSUs

A marginal PSU (power supply unit) can give results that are both
far-reaching and puzzling.  It will tend to make the system behave
unpredictably, with no real pattern.  One way to spot this symptom is to
temporarily connect a known-good PSU and see if the system suddenly
becomes reliable.  Alternatively, temporarily disconnect power to all
non-essential components, which reduces amperage draw.  Note:  It is
extremely common for a PSU that catastrophically fails to destroy all
attached hard drives on its way out.

6.  CD/DVD Drives

The last of the most common problem components is the CD or DVD drive,
generally on account of dirt/dust gradually introduced into the drive
from dirty disks or careless unit storage, which gets grit onto the
lens, and ruins drive reliability.  

Primary Software Tools

1. Cerberus Test Control System

One primary tool for testing RAM is the Cerberus Test Control System,
formerly used at hardware manufacturer VA Linux Systems to stress-test
all new or repaired hardware for several days before shipment.  

Since it is complex and has some very non-intuitive aspects, I've 
written it up in a separate lecture, "Hardware Stress Testing".

2.  Knoppix

Knoppix is famous as a very easy to use, desktop-user-friendly Linux
distribution that by default boots and runs entirely from a CD or DVD
drive.  However, it includes a large number of tools useful for hardware
diagnosis.  Here are a few:

(a) "dmesg | less" is a command you run from Knoppix (or any Linux
system's) command console to see the system "console log", including all
boot-up messages.  It is frequently extremely revealing about what sort
of hardware is present and what problems are noted at boot time.

(b) memtest86 is a Linux utility that runs as a standalone boot routine,
i.e., you boot into it from a bootloader on a CDR or PXEboot option, and
it then runs as the sole process until the end of testing, which
involves a long series of writes to and reads from memory in varied
patterns, to try to trigger errors.  Unfortunately, to be thorough, its
testing takes time.  Therefore, to avoid the possibility of an
inconclusive test you should generally pick the "extensive test" option,
which adds lengthy-running patterns 8, 9, and 10, and should run those
tests for at least 24+ hours on typical machines.

The Cerberus Test Control System (above) includes a variant form of
memtest86 as one of its numerous simultaneous tests.

(c) lspci, lsusb, dmidecode, X11 log

These are all utilities that report on what hardware is present on your
machine, in various categories.  The X11 log will be written inside
/var/log, and the name depends on your Knoppix (or other Linux
distribution) version.

(d) /proc/cpuinfo, /proc/meminfo

These are informational pseudofiles with measure of system CPU
characteristics and RAM characteristics, respectively.  Notice the
"BogoMIPS" field in cpuinfo.  That is a measure of how quickly the
CPU(s) handle the idle loop, and are a quick measure of CPU health.  See
http://www.clifton.nl/index.html?bogomips.html for details of how high
the measure should be for each standard CPU class.  If your system's
measure is off, or changes unexpectedly, that points to CPU problems.

The "flags" field in cpuinfo includes a number of useful data, including
the "lm" flag, which will be present if and only if the machine is
x86_64-capable.  

(e) badblocks.  You can run this utility to write/read test your hard
drive(s).  Again, comprehensive testing takes many hours or overnight.

3. An MS-DOS CD-ROM.  If you don't already have one, you can download a
usable image file via http://www.bootdisk.com/ .  These are frequently
needed to run firmware-updating routines, manufacturer hardware-testing
routines, etc.

I maintain a list of all known manufacturers' hard drive testing routines 
on my Web site at http://linuxmafia.com/faq/Hardware/hdutils.html .  Those
are (probably still) 100% DOS utilities.

4.  If a machine is behaving strangely, consider going into the system
BIOS Setup and examining all settings.  Consider, as well, resetting the
system BIOS to factory defaults, which will always be an offered option
on screen to reboot (optionally saving changes), as the factory settings
will tend to be generally "sane" and result in conservative operation.

5.  I mention the Linux smartmontools mostly to suggest that they're the
wrong tool at the wrong time, if you're encountering a system for
diagnosis as opposed to starting to use it and only eventually
diagnosing it.  That is, this suite is designed to monitor a system's
hard drive self-reporting routines and track them over a long period of
time, using trend analysis to predict near-future failure after the
pattern is established.  Therefore, in a situation where you suddenly
have to figure out what's wrong, and there aren't existing smartmontools
records on file, it's the wrong tool.  However, running it on a healthy
Linux system is a very good idea.

Primary Hardware Tools

1.  Reseating/replugging.  Often, hardware strangeness can be made to
suddenly go away long-term just by unplugging and reseating (while the
system is powered down) all major components such as add-in cards and
cables.  This is especially recommended if there is any suspicion that
the unit has been recently moved physically.

2.  Dust-Off

Compressed air cans with anti-static treatment are sold under a variety
of brand-names, of which Dust-Off is one.  Again, sometimes this is all
that's required to get rid of hardware flakiness, and you will also
extend hardware life generally by reducing the likelihood of dust, heat,
and bearing problems.

3.  Spare ATAPI CD-ROM Drive and PATA Cable

Some server and desktop units don't include a removable-disk drive, or
the drive is suspect, but all have places to connect one on the
motherboard.  The drives and ribbon cables are very cheap, so you should
have one, marked in pen as to when it was last determined to be
known-good.

4.  Spare PSU

Beware of the diversity of PSUs needed by today's more-powerful CPUs,
but it's still possible to keep some known-good units around, for use to
determine if the in-place PSU is weak.

5.  Other obvious tools:

Medium Philips screwdriver, flat-head screwdriver, Torx driver, small
flashlight, needlenose pliers, cable ties, cyanoacrylate glue, USB flash
drive, writable CDRs, tin of spare screws.