Stress Testing Hardware

Lecture Notes/Outline by Rick Moen

Abstract / Why This Matters:

There are times when it makes sense to perform at least initial hardware diagnosis / triage; this lecture discusses in depth one leading tool for so doing.



Hardly anyone, these days, has diagnosis of computer hardware as part of his/her job, and it's vital to avoid spending time on such efforts except where they make business sense: When in doubt, think twice before spending significant amounts of time working through these techniques, to verify that they're a business priority.

For one thing, often computer hardware is under maintenance contracts, either pursuant to manufacturer / vendor warranty obligations or separate IT contracts for hardware repair. That having been said, there are always situations where time spent doing at least initial hardware diagnosis is justified anyway, e.g., to short-circuit finger-pointing between vendors or to overcome vendor reluctance to give satisfactory resolutions.

This lecture covers one general-purpose tool for testing computer hardware, the Cerberus Test Control System, which puts all parts of the system under heavy load for a test period you determine, running until you stop it.

Cerberus Test Control System Overview

Cerberus Test Control System (with Cerberus being pronounced "sir-burr-us") is a test harness of configuration files and scripts formerly used at hardware manufacturer VA Linux Systems to stress-test all new or repaired hardware for several days before shipment. The spelling with a C was to distinguish it from the authentication software Kerberos. The CTCS suite has long available to the public under the GNU General Public License.

Cerberus aka CTCS (Cerberus Test Control System, is both (1) a little rough-edged and (2) weird in its user interface. It works best with an installed and fairly full featured set of Linux server packages on the HD. You should also have the distribution's own kernel source tree, unpacked somewhere. There must be a symlink /usr/src/linux pointing to that unpacked source tree. (This is necessary because one of Cerberus's many tests is an iterative kernel compile.)

Given all of that, what you do is this:

cd to /tmp
Unpack the Cerberus tarball. cd into it.
"make". Wait for the compile to finish.

Cerberus will then run, bogging down the machine frightfully, and you'll see a rather minimal display on the console showing time elapsed except for colourful messages if/when something major has an error. When you want to stop the testing, e.g., to check results, type Ctrl-C and wait ~30 seconds, because Cerberus really does take that long to notice your keystroke. You'll exit back to the shell.

At this point, you'll want to cd into the logging directory. This will be of the form .newburn.tcf.log.NNNN, where NNNN is a number generated randomly the first time you do "./newburn".

You'll now see in that directory a logfile for each test. Easiest way to do a quick check is this: "grep FAILED *". If that retuns null, then you have 100% clean results. If not, then you have to look more closely at the logfile that purported to yield non-clean results. Some interpretation is sometimes required to determine what really happened. E.g., I've seen people get basically spurious errors on the kernel-compile test because they used a slightly incompatible kernel source tree that is not the one provided by the distribution, or because they plugged and unplugged a USB flash drive (adding a phantom drive to those Cerberus will try to surface-test), or because they forgot to create a /usr/src/linux symlink.

To restart the test where you left off, do "cd .." to exit the logfile directory and then do "./newburn" again. Note that the elapsed-time indicator resumes where it left off. That is because it's still logging to the same .newburn.tcf.log.NNNN directory as before.

To start a completely new test run, i.e., to start Cerberus with a .newburn.tcf.log.NNNN directory for some new value of NNNN, do "./burnreset" before the next "./newburn".

The most common question I get is "Do I have to have a Linux distribution installed? Can I use it to test my Windows machines if I boot from a Knoppix disk?" The answer is: Sort of, but not easily. Cerberus is designed to test and "burn in" Linux systems having been originally developed testing of machines being manufactured by VA Linux Systems, Inc., as part of the factory QA process, preparatory to sale.

The main obstacle to running Cerberus from Knoppix or similar is that it expects /usr/src/linux to exist and point to a full kernel tree. Knoppix doesn't provide this, and /usr/src is on the CD itself when you boot, i.e., you cannot edit its contents. (The filesystem is not writeable.)

However, you can edit, with some work, the tests that comprise Cerberus. (Cerberus is controlled entirely by interpreted scripts and editable configuration files.) Find and change the references to /usr/src/linux to point (instead) to somewhere you can write to on a Knoppix system, e.g., /tmp/kerneltree (and, of course, make sure Cerberus finds a suitable full kernel tree there). Also, you may find that you need to disable the portions of Cerberus that test the hard disk (using "iozone"), as they expect to see Linux-native, writeable filesystems not MS-stuff like "FAT", "NTFS", etc.

The other question I get is: "Is Cerberus safe to use on a production machine, i.e., is there any chance it will hurt my files or partitions?" If you read the Cerberus docs, the developers warn that they're making absolutely no guarantees and neither will I. However, I've used it on many production systems, without harm. One of the disk-testing routines does sometimes leave one or two garbage files behind in the root directory of each filesystem if you exit Cerberus abnormally, e.g., by hard-rebooting. Otherwise, I find that it creates only logfiles, and only underneath the runtime directory in which you started it.

And it's important to know that you must be the root user to successfully run Cerberus. (E.g., on Knoppix, do "sudo bash" first.)

The numerous tests in CTCS are mostly not aimed at any particular subsystem, but do point to some subsystems more than others if you study the logged results attentively. For example, the "kernel" test -- which does multiple, parellelised kernel compiles using gcc -- while not specifically a CPU, memory, or motherboard component test, will tested to disclose problems in hardware that currently no "memory test" or "CPU test" can detect.

Standard Tests

1. Jason Collins's improved memory test (rewrite based on Larry Augustin's original memory tester)

Module: memtest
Burn-in identifier:  MEMORY0, MEMORY1, et. al

This is a vicious little memory system thrasher that not only exercises hardware but also the Linux paging system. Only the best memory can survive this. Unfortunately, since it is a user-space application, it can't touch all memory, but CTCS will attempt to configure it appropriately for you.

2. Kernel compilation loop

Module: kernel
Burn-in identifier:  FKCOMPILE, KCOMPILE

It's well known that the GNU C compiler can reveal problems with memory, CPU, and cache RAM. The kernel compilation loop repeatedly compiles the Linux kernel, looking for GCC signal 11 or other errors.

3. Disk (block device) read tests

Module: sblockrdtest
Burn-in identifier:  BBhdaN0, BBhdaN1, BBsdcN4, etc

CTCS's newburn will automatically launch 8 staggered linear reads for each hard disk to induce rapid seeking. If a sector can't be read, these tests will break. BB stands for "badblocks", the program used to execute the test. hda, sda, etc. are block devices, N? indicates the number in the group of 8 staggered reads.

4. System Log monitors

Module: dmesg
Module: messages
Burn-in identifier:  DMESG, SYSLOG

The Linux kernel is capable of detecting a large number of error conditions that are outside the realm of user applications such as Cerberus. CTCS will launch log monitors to watch for oopses or other miscellaneous errors while the rest of the system is running.

5. DAC960 driver monitor

Module: dac960
Burn-in identifier:  DAC960C0, DAC960C1, etc.

The dac960 driver monitor will watch the current status file for each DAC960 controller (/proc/rd/c[controller number]/current_status) and will report any errors. This test won't be run if you don't have any controllers handled by the dac960 driver.

(This last test will run only on machines with Mylex SCSI RAID cards, whick are now rare.)

Optional add-ons

There are wrapper scripts in CTCS for the following optional test components, that can be fetched separately. (See the help on the command line options for "newburn" and the documentation in CTCS's README.tests.)

1. Linux Test Project (SGI),

This project is similar in structure to CTCS, but they have a lot more programmers. The module API is mostly compatible with CTCS, but CTCS's driver scripts are more intelligent. They also have more modules, and they're more focused on software testing then hardware.

2. BYTE Benchmarking Suite (Linux Port),

This is a good port of the generic benchmark utility for Linux. You can use it for exercising basic floating/integer/memory/IO operations.

3. UCSC SmartSuite,

Monitor your hard drives for failures, and predict them before they happen. Very nice, if your drives support it.


Portions of these notes enumerating CTCS's standard and optional feature sets were taken from the VA-CTCS FAQ in the project source repository and modified for presentation here.