[conspire] One way to test system RAM

Thu Dec 28 15:34:56 PST 2006

Quoting Eric De Mund (ead-conspire at ixian.com):

> Point of clarification: May I ask why you're using a home-grown method
> rather than the exceptionally thorough Memtest86 <http://memtest86.com/>?
> In past years I've had good luck with Memtest86, both for clients and
> personally. Note that in some cases on slower machines it has taken 20+
> hours to run to completion.

Well, normally, I'd have done exactly that, first before anything else.  
In this particular case, the conclusively bad RAM _had_ passed at least
(if memory serves) at least 15 hours of memtest86, a number of months
earlier -- probably a lot longer.  So, on the rebound I felt just a
little funny about relying principally on a tool that had -- or so it
seemed -- let me down for the first time on that very same hardware.
So, I reverted to massively parallelised kernel builds, the longtime
traditional RAM-testing method that was always _damned_ effective long
before memtest86 existed at all.  More about why and how, below.

For the record:  The original signs of trouble with the four sticks of
scrounged RAM (2 x 512MB, 2 x 256 MB) was very, very occasional
spontaneous rebooting.  I'd observed this on rare occasions following 
work on the machine:  I'd walk back to it (or ssh in), and suddenly I
had short uptime again.  I continued to feel vaguely uneasy about this,
but there were never any sign of kernel panics, NMIs, or other specific
indications of trouble.  However, I _did_ do at least one run of
memtest86 off a Knoppix CD (can't remember exactly how many hours -- I
_believe_ it was 24+ hours, really), and came up clean.  (I _still_ felt
uneasy, but didn't have anywhere to go with that, at the time.)

These were _ECC_ sticks, mind you.  I had reason to believe that they 
would be at least registering problems _somewhere_ I should notice.  

Something still seems just a little hinky, but I wasn't clear on exactly
what.  Experience told me to suspect the RAM first, but I'd _checked_
it!

In hindsight, there's something else, easy to do, that I should have
done / checked right about then:  If you have reason to suspect RAM, but
for whatever reason can't get a consistent, reproducible symptom, try
shuffling around the position of the sticks in their various sockets.
Also, if possible, try individual sticks one at a time (i.e., remove the
others from the machine for testing purposes).  Sometimes, the problem
will manifest clearly with the sticks in some configurations but not
others, and apparently I'd accidentally stumbled onto one of _those_
configurations where the RAM wasn't reliable tut still tested clean.

Also, remember that you must consider other not-known-good parts as
suspects.  E.g., at a later point in my testing, when I'd seen fairly
compelling evidence of both 512 MB sticks having problems in J0 with no
other RAM present, I had to consider the possibility that socket J0
itself on the motherboard was intermittant or bad.

Anyhow, what we observed late Saturday night, near the end of the CABAL
meeting when I tried to resume the long-delayed migration, was the
dreaded "NMI: Dazed and confused but struggling to continue" kernel
message -- which most of the time means a bad stick of RAM, and I was
suddenly reminded of those unexplained spontaneous reboots, and the
distinctly non-reliable origin of some if not all of the four sticks
present.

CABAL attendees present of course discussed the problem with me for
about an hour from 11 PM to midnight, as I tried to review the recent
(prior year) history of the machine, the principles of hardware
diagnosis, an inventory of what parts were known-good (sadly, none), the
disadvantages of having no known-good parts on the shelf for diagnostic
purposes, the difference between a Heisenbug and a Bohr Bug, the
unlikelihood of coincidental failure of two parts without a good
independent reason to believe that (or a common cause for both failing),
the reason why the biggest risk for diagnostic situations is wasting
time getting inconclusive (or misleading) results, and why particular
suggestions people were making (all in the form of _questions_,
something that unfortunately raises diagnosticians' stress levels
unnecessarily and interrupts their concentration -- speaking for myself,
at least) weren't useful under present circumstances.

An old boss of mine in around 1990 at Blyth Software, MIS manager David
Carroll, had a characteristic saying he invariably used in diagnostic
situations:  He'd ask, of nobody in particular:  "What do we know?"  For
a while, I mostly classed it as a personal quirk, but eventually I came
to understand that it was an extremely helpful tool for shortening
diagnostic time:  You think:  What is the known history of this problem,
what are the variables, what assumptions am I currently making, what are
the suspects, what are the known-good elements?  You repeat that
discipline at intervals, to make sure you're not haring off on wild
good-chases or wasting time doing something doomed to prove ultimately
inconclusive.

Karsten Self had a reasonable suggestion, albeit _also_ posed in the
form of a question ;-> -- of running parallel kernel compiles overnight,
especially since other easy forms of diagnosis hadn't yet borne fruit.  
And then, at point, it was midnight, and people needed to go home.

I set up "while : ; do make clean && make -j 4 ; done" to implement
Karsten's suggestion overnight, and went to bed -- while pondering and
sleeping on the question "What do we know?".

Sunday morning, the compile was still on-screen, but the machine was
frozen.  Aha!  Double aha!

Aha #1 was:  Here we have pretty strong evidence of either a memory, or
CPU, or motherboard defect, in roughly that order of likelihood and with
RAM way out in front.

Aha #2 was:  The fact that the ongoing compile was still visible
onscreen meant that it'd taken less than around _5 minutes_ the prior
night, because the console driver's screen-blanker hadn't had time to
kick in.

That's something to remember:  If you're doing hardware diagnosis, you
really should probably disable the screen-blanker.  I keep forgetting
the command option for that, and for years kept having to re-find even
the command _name_ on the Keyboard and Console HOWTO:  It's "setterm".
These days, I only have to re-read the setterm manpage (and I just now
needed to do that again).  The magic incantation is: 

   setterm -blank 0

If you _really_ want to stress-test a machine, just install and run
Cerberus Test Control System (see:  "Cerberus FAQ" on
http://linuxmafia.com/kb/Hardware ), which was the testing suite
developed to "burn in" both new and repaired machines at VA Linux
Systems:  It runs both parellelised kernel compiles _and_ memtest86
_and_ a bunch of other hardware-stressing processes, all at the same
time.  It also automatically disables the screen-blanker, and puts right
on the console a time counter showing how long burn-in has been running 
(and allow you to quickly confirm visually that the system isn't
frozen).

Anyhow, at this point, I really _did_ have a smoking gun pointing at (at
least one stick of) RAM, but didn't know which RAM.

I tried one stick of time in the low-numbered socket, J0.  With one of
the 512 MB sticks, I sometimes got an odd POST (power-on self-test)
message saying that extended RAM had failed a quality check.  Then, on
subsequent reboots, not.  With the other 512 MB stick only, not.  Then
not again.  Then a similiar sort of POST message.  Then not.  Then not.
Neither of them was freezing during kernel compiles.  Then one was.
Then neither was.

No such errors with either of the 256 MB sticks.  

As before, hinky, but not smoking-gun.  (I was not yet at that time
taking care to ensure that _all_ of the RAM was being exercised during
kernel compiles.  This was somewhat stupid on my part, but my excuse is
that I'm unused to having enough RAM for that to be a concern.)

A word about POST error messages:  Ordinarily, they're at best a blunt
instrument, and motherboard-based RAM-checking is _usually_ nearly
completely useless.  Usually, it's pretty much a bad joke:  You're in
false-negative country pretty much all the time, which is enough to give
one a bad attitude towards such things, generally.

I kept thinking about the situation, though, and remembered that this
was and is a respected _server_ motherboard (the Intel L440GX+
"Lancewood"), and it was and is _ECC_ RAM.  So, in the "What do we
know?" department, I reconsidered my then-current (and longtime)
assumption that BIOS-based RAM-checking is just about always laughably
useless and pointless.

I went into the L440GX+ BIOS Setup, and enabled the maximal amount of
extended-memory testing.  Someone else with similar prejudices had
probably cut that back to the bare minimum at some point in the past, in
the name of shorter boot time.  (The machine was Reg Charney's before it
was mine.  He gave it to me, shortly before he moved back to Toronto.)

Now, I was _consistently_ getting BIOS messages claiming that there was
an extended memory quality problem.  The BIOS, moreover, suggested that
there was a bad stick in socket J0.  It did this regardless of whether
the stick in J0 was one 512 MB stick, or the other as the sole piece of
memory -- but not with either of the 256 MB sticks in there.

This was the point at which I had to consider the possibility of a
defective J0 socket.  (Remember, the watchword is:  "What do we know?")
It was possible that the socket's leads were intermittant, or that the
socket's decoding logic on the motherboard was generating errors on 512
MB-density sticks but not smaller ones.  I tried one 256 MB stick in J0
and a 512 MB stick in J1: The POST error now fingered J1.  Tried the
other 512 MB stick in J1: POST still fingered J1.

Hmm.  Those Intel enginers got it right.  Damn them.  ;->

At this point, I sighed, threw both 512 MB sticks away, started
parallelised continuous kernel compiles on the server's remaining RAM
(1/3 of what I'd started with), and used Firefox on my iBook to order
two fresh sticks from www.satech.com ($171, including $13 for Der
Gubernator).

When that merchandise was available for will-call pickup, and I drove
down to Santa Clara to get it, _that's_ why I immediately went back to
properly-crafted kernel compiles (i.e., taking care to exercise _all_ of
the 1.5 GB, and drive the machine mildly into swap), instead of
memtest86.  Which have now run long enough to satisfy me -- so I 
might do memtest86 as well, on Ye Olde Belt-and-Suspenders Theory.

But that would honestly be gilding the lily.  This _is_ now good RAM.