[sf-lug] Stress Testing (was: memory integrity checking / excluding bad memory)

Sat Feb 2 22:13:28 PST 2008

And here's my other (possibly relevant) set of leture notes from days of
yore:

Stress-Testing Hardware
Lecture Notes/Outline by Rick Moen

Abstract / Why This Matters:  There are times when it makes sense to
perform at least initial hardware diagnosis / triage; this lecture
discusses in depth one leading tool for so doing.

Table of Contents

Introduction
Cerberus Test Control System Overview
Standard Tests
  Memory Test
  Kernel Compile Loop
  Disk (Block Device) Read Tests
  System Log Monitor
  DAC 960 Driver Monitor
Optional Tests
  Linux Test Project
  BYTE Benchmarking Suite
  UCSC SmartSuite
Acknowledgements

Introduction

Hardly anyone has diagnosis of computer hardware as part of his/her job,
and it's vital to avoid spending time on such efforts except where they
make business sense:  When in doubt, please consider carefully before
spending significant amounts of time working through these techniques,
to verify that they're worth the time and effort away from other
productive tasks you'll need to spend.

For one thing, much computer hardware is under maintenance contracts,
either pursuant to manufacturer / vendor warranty obligations or
separate IT contracts, e.g., for hardware repair.  That having been
said, there are always situations where user time spent doing at least
initial hardware diagnosis is justified anyway, e.g., to short-circuit
finger-pointing between vendors or to overcome vendor reluctance to give
us satisfactory resolutions.

This lecture covers one general-purpose tool for testing computer
hardware, the Cerberus Test Control System, which puts all parts of 
the system under heavy load for a test period you determine, running
until you stop it.

Cerberus Test Control System Overview

Cerberus Test Control System (with Cerberus being pronounced
"sir-burr-us") is a test harness of configuration files and scripts
formerly used at hardware manufacturer VA Linux Systems to stress-test
all new or repaired hardware for several days before shipment.  The
spelling with a C was to distinguish it from the authentication software
Kerberos.   The CTCS suite has long available to the public under the
GNU General Public License.

Cerberus aka CTCS (Cerberus Test Control System,
http://sourceforge.net/projects/va-ctcs/) is both (1) a little
rough-edged and (2) weird in its user interface. It works best with an
installed and fairly full featured set of Linux server packages on the
HD. You should also have the distribution's own kernel source tree,
unpacked somewhere.  There must be a symlink /usr/src/linux pointing to
that unpacked source tree.  (This is necessary because one of Cerberus's
many tests is an iterative kernel compile.)

Given all of that, what you do is this:

    * cd to /tmp
    * Unpack the Cerberus tarball. cd into it.
    * "make". Wait for the compile to finish.
    * "./newburn" 

Cerberus will then run, bogging down the machine frightfully, and you'll
see a rather minimal display on the console showing time elapsed  except
for colourful messages if/when something major has an error. When you
want to stop the testing, e.g., to check results, type Ctrl-C and wait
~30 seconds, because Cerberus really does take that long to notice your
keystroke. You'll exit back to the shell.

At this point, you'll want to cd into the logging directory. This will
be of the form .newburn.tcf.log.NNNN, where NNNN is a number generated
randomly the first time you do "./newburn".

You'll now see in that directory a logfile for each test. Easiest way to
do a quick check is this: "grep FAILED *". If that returns null, then you
have 100% clean results. If not, then you have to look more closely at
the logfile that purported to yield non-clean results. Some
interpretation is sometimes required to determine what really happened.
E.g., I've seen people get basically spurious errors on the
kernel-compile test because they used a slightly incompatible kernel
source tree that is not the one provided by the distribution, or because
they plugged and unplugged a USB flash drive (adding a phantom drive to
those Cerberus will try to surface-test), or because they forgot to
create a /usr/src/linux symlink.

To restart the test where you left off, do "cd .." to exit the logfile
directory and then do "./newburn" again. Note that the elapsed-time
indicator resumes where it left off. That is because it's still logging
to the same .newburn.tcf.log.NNNN directory as before.

To start a completely new test run, i.e., to start Cerberus with a
.newburn.tcf.log.NNNN directory for some new value of NNNN, do
"./burnreset" before the next "./newburn".

The most common question I get is "Do I have to have a Linux
distribution installed? Can I use it to test my Windows machines if I
boot from a Knoppix disk?" The answer is: Sort of, but not easily.
Cerberus is designed to test and "burn in" Linux systems  having been
originally developed testing of machines being manufactured by VA Linux
Systems, Inc., as part of the factory QA process, preparatory to sale.

The main obstacle to running Cerberus from Knoppix or similar is that it
expects /usr/src/linux to exist and point to a full kernel tree. Knoppix
doesn't provide this, and /usr/src is on the CD itself when you boot,
i.e., you cannot edit its contents. (The filesystem is not writable.)

However, you can edit, with some work, the tests that comprise Cerberus.
(Cerberus is controlled entirely by interpreted scripts and editable
configuration files.) Find and change the references to /usr/src/linux
to point (instead) to somewhere you can write to on a Knoppix system,
e.g., /tmp/kerneltree (and, of course, make sure Cerberus finds a
suitable full kernel tree there). Also, you may find that you need to
disable the portions of Cerberus that test the hard disk (using
"iozone"), as they expect to see Linux-native, writable filesystems
not MS-stuff like "FAT", "NTFS", etc.

The other question I get is: "Is Cerberus safe to use on a production
machine, i.e., is there any chance it will hurt my files or partitions?"
If you read the Cerberus docs, the developers warn that they're making
absolutely no guarantees and neither will I. However, I've used it on
many production systems, without harm. One of the disk-testing routines
does sometimes leave one or two garbage files behind in the root
directory of each filesystem if you exit Cerberus abnormally, e.g., by
hard-rebooting. Otherwise, I find that it creates only logfiles, and
only underneath the runtime directory in which you started it.

And it's important to know that you must be the root user to
successfully run Cerberus. (E.g., on Knoppix, do "sudo bash" first.)

The numerous tests in CTCS are mostly not aimed at any particular
subsystem, but do point to some subsystems more than others if you study
the logged results attentively.  For example, the "kernel" test -- which
does multiple, parellelised kernel compiles using gcc -- while not
specifically a CPU, memory, or motherboard component test, will tested
to disclose problems in hardware that currently no "memory test" or "CPU
test" can detect.

Standard Tests

1. Jason Collins's improved memory test (rewrite based on Larry Augustin's 
original memory tester)
Module:	memtest
Burn-in identifier:  MEMORY0, MEMORY1, et al.

This is a vicious little memory system thrasher that not
only exercises hardware but also the Linux paging system.
Only the best memory can survive this.  Unfortunately,
since it is a user-space application, it can't touch
all memory, but CTCS will attempt to configure it
appropriately for you.

2. Kernel compilation loop
Module: kernel
Burn-in identifier:  FKCOMPILE, KCOMPILE

It's well known that the GNU C compiler can reveal problems
with memory, CPU, and cache RAM.  The kernel compilation loop
repeatedly compiles the Linux kernel, looking for GCC
signal 11 or other errors.

3. Disk (block device) read tests
Module: sblockrdtest
Burn-in identifier:  BBhdaN0, BBhdaN1, BBsdcN4, etc

CTCS's newburn will automatically launch 8 staggered
linear reads for each hard disk to induce rapid seeking.
If a sector can't be read, these tests will break.
BB stands for "badblocks", the program used to execute the
test.  hda, sda, etc. are block devices, N? indicates
the number in the group of 8 staggered reads.

4. System Log monitors
Module: dmesg
Module: messages
Burn-in identifier:  DMESG, SYSLOG

The Linux kernel is capable of detecting a large number
of error conditions that are outside the realm of
user applications such as Cerberus.  CTCS will
launch log monitors to watch for oopses or other
miscellaneous errors while the rest of the system
is running.

5. DAC960 driver monitor
Module: dac960
Burn-in identifier:  DAC960C0, DAC960C1, etc.

The dac960 driver monitor will watch the current status file
for each DAC960 controller 
(/proc/rd/c[controller number]/current_status) and will
report any errors.  This test won't be run if you don't
have any controllers handled by the dac960 driver.

(This last test will seldom run, since modern machines thankfully no
longer use those particular Mylex SCSI RAID cards.)

Optional add-ons

There are wrapper scripts in CTCS for the following optional test
components, that can be fetched separately.  (See the help on the
command line options for "newburn" and the documentation in CTCS's
README.tests.)

1. Linux Test Project (SGI)
http://ltp.sourceforge.net
This project is similar in structure to CTCS, but
they have a lot more programmers.  The module
API is mostly compatible with CTCS, but
CTCS's driver scripts are more intelligent.
They also have more modules, and they're more focused
on software testing then hardware.

2. BYTE Benchmarking Suite (Linux Port)
http://www.tux.org/~mayer/linux/bmark.html
This is a good port of the generic benchmark utility
for Linux.  You can use it for exercising basic
floating/integer/memory/IO operations.

3. UCSC SmartSuite
http://csl.cse.ucsc.edu/software/smart/
Monitor your hard drives for failures, and predict 
them before they happen.  Very nice, if your
drives support it.

Acknowledgements

Portions of these notes enumerating CTCS's standard and optional feature
sets were taken from the VA-CTCS FAQ in the project source repository
and modified for presentation here.