Lecture Notes/Outline by Rick Moen
There are times when it makes sense to perform at least initial hardware diagnosis / triage; this lecture suggests tools and techniques for so doing.
Hardly anyone, these days, has diagnosis of computer hardware as part of his/her job, and it's vital to avoid spending time on such efforts except where they make business sense: When in doubt, please think twice before spending significant amounts of time working through these techniques, to verify that they're a priority.
For one thing, often computer hardware is under maintenance contracts, either pursuant to manufacturer / vendor warranty obligations or separate IT contracts for hardware repair. That having been said, there are always situations where time spent doing at least initial hardware diagnosis is justified anyway, e.g., to short-circuit finger-pointing between vendors or to overcome vendor reluctance to give satisfactory resolutions.
A former boss of mine in the "Management Information Systems" (IT) department at now-vanished database firm Blyth Software, Mr. David Carroll, had a characteristic expression that he would say, to nobody in particular, whenever we were trying to solve a technical problem: "What do we know?" At first, I mostly classed it as a personal quirk, but eventually came to understand that it was an extremely helpful tool for shortening diagnostic time: You think: What is the known history of this problem, what are the variables, what assumptions am I currently making, what are the suspects, what are the known-good elements? You repeat that discipline at intervals, to make sure you're not haring off on wild good-chases or wasting time doing something doomed to prove ultimately inconclusive.
His other frequent observation on the matter was that "Any diagnostic theory, if pursued logically enough, will eventually arrive at the right answer." The trick, of course, was to both be sufficiently logical and pick an efficient diagnostic path (series of hypotheses).
All of the other concepts I will outline below aim to apply those two concepts. (This section is of necessity purely conceptual, which I'll fill in later.)
Carroll's caution about pursuing a theory logically implies carefull avoiding inconclusive procedures: If your answer to the "What do we know?" question is the same before and after a procedure, then it's wasted time. It's important to remember this, e.g., if onlookers ask you "Why aren't you doing [procedure foo]?, where [foo] has no information value.
Suggestions from users and onlookers are very common and almost never useful, since most people don't understand diagnosis and haven't bothered to acquaint themselves with background facts of the case. Be prepared to summon up some polite equivalent of "Because I said so; now, please let me work" for such situations.
Often, especially when the problem is one you yourself didn't discover, the first and most crucial step is replication, i.e., observing the problem symptoms for yourself. It's a truism that technical problems that cannot be reproduced also cannot be fixed, and technical support departments the world over tend to have some problem-resolution status analogous to the infamous "CNR" = Could Not Reproduce, for problems that cannot even be observed, let alone fixed.
In that area, diagnosticians make a tongue-in-cheek distinction between "Heisenbugs", i.e., bugs that are difficult to reproduce because they behave in a Werner Heisenberg-like indeterminate fashion, not being able to be produced at need, and "Bohr Bugs", those that are the opposite (like Neils Bohr's competing static model of the atom).
When you are starting a diagnostic track, you are trying to narrow down what of various possible causes is producing the symptom. Towards that end, it's very useful to know, or to determine, which suspect components are already known to be non-defective, called "known-good". In general, the process of diagnosis entails enumerating (in your head) all the things that could cause the problem, and narrowing down the suspects to the one that must be responsible.
Writer Damon Runyon once wrote that "The race is not to the swift, nor the batter to the strong, but that's the way to bet." Similarly, successful diagnosis requires a good seat-of-the-pants notion of probability: Diagnosticians tend from experience to strongly resist any conclusion involving coincidence, e.g., that a failure was caused by two parts failing at the same time, unless there is a plausible candidate cause underlying all parts of the "coincidence", e.g., you decide that two sticks of RAM are both defective, because they both turn out to have been pulled from a pile of untested suspect parts.
Another important conceptual trait I'll mention is predictiveness, or "If so, then...." If you're going on a hunch that your unit's misbehaving because of heat buildup, then you should stop and think about how such a unit would and would not behave. Does the problem manifest only starting half an hour after a cold startup and never before?
Your senses of hearing and (to a lesser extent) smell are quite important, and mustn't be forgotten: If you've gotten to know the background sounds of a system when it's starting up, e.g., the initialisation and seeking of the various hard drives, it really stands out when those sounds are in part missing, or different. Can you hear the various fans starting up? The hard drives sequentially spinning up and seeking their respective sector zeros? Smell is also significant in the sense of helping you detect incipient overheating before it causes disaster, and also noticing the moment of an electronic component's death through electrocution or overheating: The acrid smell a semiconductor emits when it dies in that fashion is unforgettable, once you've encountered it the first time.
1. Loose Connections
An astonishly high percentage of hardware problems, maybe as high as half, turn out to be curable by reseating major assemblies (RAM sticks, cables, add-in cards). You should be especially quick to check this if there's any reason to suspect the unit has been recently moved or jostled.
2. Bad RAM
Bad memory chips are increasingly a problem in computers, because of pressure on RAM prices, higher densities, and relatively worse fault-detection in most OSes. Consequences of bad RAM only begin with unreliable operation: At worst, you can have gradual silent corruption of stored data every time it passes through the defective RAM segment, with the result that, once the data corruption is found, management must choose between very old data on backup sets and more-recent but gradually more corrupted data.
Most "memory-checking" routines in the Power-On Self Test (POST) routines of computer boot-time BIOS Setup programs are worse than useless -- useless because they give false assurance of quality, while giving their OK routinely to grossly defective RAM. The exception is motherboard-based checking of ECC RAM, where ECC RAM is actually present and where the full checking routines are enabled in the BIOS Setup (which is often not the default).
Signs of bad RAM include: segmentation-faulting processes, spontaneous reboots, kernel panic messages entered into systems logfiles (which may include "NMI: dazed and confused but struggling to continue"), mysteriously dying processes. Of course, there may be other causes of these symptoms, such as NFS problems.
Sometimes, "shy" memory problem symptoms can be brought to the fore by reshuffling the order of the RAM sticks in their sockets, or by testing sticks in the minimum quantity at a time. (Some machines need RAM deployed in pairs, because of interleaving.)
3. Hard Drives
Hard drives are the next most common cause of problems, but failures are generally quite obvious. You will either get inability to access the drive at boot, or sector seek errors, or "sense key" errors.
4. Heat Buildup
Heat buildup is the next most common problem, and one of the most pernicious, because it weakens all components in the affected machine and shortens all of their lifetimes. Note that once one hard drive in a system fails because of system overheating, any other hard drives in it are at very high risk. Be certain to check all internal fans, if you have any suspicion of heat problems: Being mechanical, they're prone to failure from seized bearings.
A marginal PSU (power supply unit) can give results that are both far-reaching and puzzling. It will tend to make the system behave unpredictably, with no real pattern. One way to spot this symptom is to temporarily connect a known-good PSU and see if the system suddenly becomes reliable. Alternatively, temporarily disconnect power to all non-essential components, which reduces amperage draw. Note: It is extremely common for a PSU that catastrophically fails to destroy all attached hard drives on its way out.
6. CD/DVD Drives
The last of the most common problem components is the CD or DVD drive, generally on account of dirt/dust gradually introduced into the drive from dirty disks or careless unit storage, which gets grit onto the lens, and ruins drive reliability.
1. Cerberus Test Control System
One primary tool for testing RAM is the Cerberus Test Control System, formerly used at hardware manufacturer VA Linux Systems to stress-test all new or repaired hardware for several days before shipment.
Since it is complex and has some very non-intuitive aspects, I've written it up in a separate lecture, "Hardware Stress Testing".
Knoppix is famous as a very easy to use, desktop-user-friendly Linux distribution that by default boots and runs entirely from a CD or DVD drive. However, it includes a large number of tools useful for hardware diagnosis. Here are a few:
(a) "dmesg | less" is a command you run from Knoppix (or any Linux system's) command console to see the system "console log", including all boot-up messages. It is frequently extremely revealing about what sort of hardware is present and what problems are noted at boot time.
(b) memtest86 is a Linux utility that runs as a standalone boot routine, i.e., you boot into it from a bootloader on a CDR or PXEboot option, and it then runs as the sole process until the end of testing, which involves a long series of writes to and reads from memory in varied patterns, to try to trigger errors. Unfortunately, to be thorough, its testing takes time. Therefore, to avoid the possibility of an inconclusive test you should generally pick the "extensive test" option, which adds lengthy-running patterns 8, 9, and 10, and should run those tests for at least 24+ hours on typical machines.
The Cerberus Test Control System (above) includes a variant form of memtest86 as one of its numerous simultaneous tests.
(c) lspci, lsusb, dmidecode, X11 log
These are all utilities that report on what hardware is present on your machine, in various categories. The X11 log will be written inside /var/log, and the name depends on your Knoppix (or other Linux distribution) version.
(d) /proc/cpuinfo, /proc/meminfo
These are informational pseudofiles with measures of system CPU characteristics and RAM characteristics, respectively. Notice the "BogoMIPS" field in cpuinfo. That is a measure of how quickly the CPU(s) handle the idle loop, and are a quick measure of CPU health. See http://www.clifton.nl/index.html?bogomips.html for details of how high the measure should be for each standard CPU class. If your system's measure is off, or changes unexpectedly, that points to CPU problems.
The "flags" field in cpuinfo includes a number of useful data, including the "lm" flag, which will be present if and only if the machine is x86_64-capable.
(e) badblocks. You can run this utility to write/read test your hard drive(s). Again, comprehensive testing takes many hours or overnight.
3. An MS-DOS CD-ROM. If you don't already have one, you can download a usable image file via http://www.bootdisk.com/ . These are frequently needed to run firmware-updating routines, manufacturer hardware-testing routines, etc.
I maintain a list of all known manufacturers' hard drive testing routines on my Web site at http://linuxmafia.com/faq/Hardware/hdutils.html . Those are (probably still) 100% DOS utilities.
4. If a machine is behaving strangely, consider going into the system BIOS Setup and examining all settings. Consider, as well, resetting the system BIOS to factory defaults, which will always be an offered option on screen to reboot (optionally saving changes), as the factory settings will tend to be generally "sane" and result in conservative operation.
5. I mention the Linux smartmontools mostly to suggest that they're the wrong tool at the wrong time, if you're encountering a system for diagnosis as opposed to starting to use it and only eventually diagnosing it. That is, this suite is designed to monitor a system's hard drive self-reporting routines and track them over a long period of time, using trend analysis to predict near-future failure after the pattern is established. Therefore, in a situation where you suddenly have to figure out what's wrong, and there aren't existing smartmontools records on file, it's the wrong tool. However, running it on a healthy Linux system is a very good idea.
1. Reseating/replugging. Often, hardware strangeness can be made to suddenly go away long-term just by unplugging and reseating (while the system is powered down) all major components such as add-in cards and cables. This is especially recommended if there is any suspicion that the unit has been recently moved physically.
Compressed air cans with anti-static treatment are sold under a variety of brand-names, of which Dust-Off is one. Again, sometimes this is all that's required to get rid of hardware flakiness, and you will also extend hardware life generally by reducing the likelihood of dust, heat, and bearing problems.
3. Spare ATAPI CD-ROM Drive and PATA Cable
Some server and desktop units don't include a removable-disk drive, or the drive is suspect, but all have places to connect one on the motherboard. The drives and ribbon cables are very cheap, so you should have one, marked in pen as to when it was last determined to be known-good.
4. Spare PSU
Beware of the diversity of PSUs needed by today's more-powerful CPUs, but it's still possible to keep some known-good units around, for use to determine if the in-place PSU is weak.
5. Other obvious tools:
Medium Philips screwdriver, flat-head screwdriver, Torx driver, small flashlight, needlenose pliers, cable ties, cyanoacrylate glue, USB flash drive, writeable CDRs, tin of spare screws.