[conspire] Old hardware, ridiculously old hardware: free RAM for you

Rick Moen rick at linuxmafia.com
Wed Jan 18 22:24:20 PST 2017


Dana was kind enough to come back over last night to help me work on a
Supermicro 2U server he'd given me.  Some months ago, Daniel had tried
to install onto it, and reported a problem; something about no video
display or something.  (I failed to take exact notes, intending to just
circle back and check it out myself.  January rolled around; It was time
to investigate.)

This is a really nice machine, that was new circa 2010.  It's very quiet
for a 2U server, well built, and Dana says it draws only about 40W at
idle, which is not bad at all.  And with ample RAM and disk, you can 
do... a great deal.


o  2U case with quiet yet effective fans
o  Supermicro X8SIE motherboard based on Intel 3420 chipset ('Ibex Peak')
o  Intel Xeon X3430 @ 2.40GHz 'Lynnfield' quad-core CPU
o  Motherboard has 6 x SATA 3Gb/s headers, AHCI interface
o  Hotswap backplane that can hold up to 8 SATA drives
o  3Ware SATA hardware RAID controller
o  2 x Intel 82574L gigabit ethernet
o  Capacity 32GB DDR-1333 ECC registered (or 16GB unregistered) dual-channel
   (interleaved) SDRAM, in six RAM sockets
o  PCI-E 2.0 x16 slot + PCI-E x8 slot + PCI 32-bit slot
o  Matrox G200eW w/16MB RAM video
o  6 x USB, plus two more as headers
o  2 x PS/2
o  2 x RS232C serial
o  Separate LAN interface for IPMI 2.0 (Realtek RTL8201N) [1] 



After Saturday's CABAL meeting, I'd kicked the machine around and
running it through memtest86.  Overnight, the machine hard-froze in that
RAM-checker, so hard that even the keyboard's NumLock key didn't even
toggle the LED.  Hard-booting caused it to enter a state where the fans 
were stuck full-blast and there was no video.  I pessimistically thought
'Likely a wonky motherboard; that's probably why it was pulled.'[1]

But wait; not so fast.  After some more alertnately leaving it unplugged
and poking it, system came back up, and I had the excellent idea of
looking around in the BIOS.  Something in the back of my mind was
probably trying to tell me 'Look in the event log!'  Many motherboards
since the 1990s have had built-in hardware event logging, with
significant detail about system hardware problems but _no_ effort to get
the admin's attention:  You have to remember to go into the BIOS and
look.  Sure enough, I found a large number of single-bit RAM errors, at
least roughly corresponding to the times of system hangs, and always
citing the same RAM stick as where the problem happened.


This machine had a pair of these 4GB Registered ECC sticks:
http://www.ubbcentral.com/store/item/Lot-of-two-(2)-Actica-ACT4GHR72R8G1333M-4GB-RAM-dimms---8GB-upgrade_152277298609.html

Well, shoot:  One of them is dodgy.  This datum so far has held out as a
candidate root cause for the freezing.  Replacing both sticks with
different RAM has made the system so far quite stable.


Longtime CABAL people may find the above story hauntingly familiar,
because the same thing happened to me in December 2006, when a pair of
bad PC100 512MB ECC SDRAM sticks on a VA Linux Systems 2230 motherboard
caused mysterious problems:

http://linuxmafia.com/pipermail/conspire/2006-December/002668.html
http://linuxmafia.com/pipermail/conspire/2006-December/002668.html
http://linuxmafia.com/pipermail/conspire/2007-January/002743.html

Even though, just like the Supermicro's 4GB stick, these were _ECC_
(error checking and correcting) server-grade RAM, these bad sticks
_still_ caused instability and the system gave zero indication to the
admin.  So, no, ECC doesn't automatically protect you.

I like to say, the best memory-checker is actually Linux.  In the linked
posts, I illustrated how to -really- test RAM -- iterative parallel
kernel compiles configured to use up all RAM (the 'parallel' part),
using 'make -j':

# cd /usr/src/linux-source-2.6.16
# while : ; do make clean && make -j N ; done

...where you adjust N upwards until you just barely start to see swap 
activity in the output of the 'free' command, or as shown by vmstat.

Back then in 2006, with my 2 x 256MB sticks in the motherboard (system
total 512MB) and the dodgy 2 x 512MB sticks removed, I gradually
increased N from 4 to 256 before RAM was being fully exercised.  With
_only_ good RAM, I could keep that while loop running indefinitely.  If
I put either of the bad RAM sticks in, I'd get freezes or spontaneous
reboots within a few hours (with N high enough to exercise all RAM).

It's important to note that _memtest86 didn't find this problem_.
I'd run it at least 24 hours just before, and no errors had showed.

So: ECC is not a cure-all.
    memtest86 won't always find bad RAM.
    iterative, parallel kernel compiles _do_ always find bad RAM.



I must say, having a well-designed 2010 rackmount server at my disposal 
(now with all of my spare SATA drives in it) has reminded me once again
that tempus fugit, and that a lot of the old hardware sitting around in
my cabinets is way past its sell-by date.  As I said here the other day,
PATA (old IDE), for example, was never very good in the first place, and
has quietly left the retail market -- gone.  And, y'know, that's good,
because SATA (and its SCSI cousin SAS) is worlds better.

While Dana was here and we were taking care of other tasks, I sat down
and researched all of Dana's old spare RAM, and labelled it.  'Eh?', you
say.

Right, old RAM:  Over time, you accumulate old RAM sticks that you leave
on a shelf, preferably in antistatic bags -- but you never make use of
it because it's real work figuring out which of your old RAM sticks
would work in what machines.  

To avert that outcome, you have to put sticky paper labels on each set
of identical RAM and write what they are, what they're good for.  I did
this for all of Dana's spare RAM.  Then, I did likewise for mine.

Except for a spare pair of sticks for the Supermicro, and some still
inside spare rackmount servers I have, mine's pretty damned old -- see
below --  and wll probably get junked soon.  _However_, if you want any
of this, speak up, and it's yours.



Laptop SDRAM (200-pin SO-DIMMs):
PC2-5300 DDR2:  Vintage 2005 or so.   2 x 512MB Hynix brand, 2 x 512MB Micron.
PC-2100 DDR:  Vintage 2002 or so.  1 x 512MB Micron brand.

Workstation/server SDRAM (DIMMs):
PC133:  Vintage 1999 or so.  8 x 256 MB Corsair brand, 168-pin
PC3200:  Vintage 2001 or so.  DDR.  1 x 256 MB Samsung brand, 184-pin
PC2-4200.  Vintage 2005 or so.  DDR2.  1 x 512 MB Samsung brand, 240-pin



If you want these, come get them!  Limited-time offer!

'DDR' in this context means Double Data-Rate (relative to
first-generation SDRAM like PC-100 and PC-133 sticks).  Each generation, 
DDR (circa 2001-2002), DDR2, and DDR3, are backwards-incompatible (and
sometimes even need different voltage from prior generations), and
accordingly have SIMM different notch positions and pin densities so you
cannot accidentally use the wrong type.  Without getting into detail
(see Wikipedia), each generation is simply faster.

Almost all _current_ SDRAM is DDR3: regular SIMM sticks for everything
but laptops, smaller SO-DIMM sticks for laptops & similar tiny machines.   
Motherboards using DDR4 (such as those using Intel Haswell CPUs) have
been also entering the market.

SDRAM took over in the late 1990s from EDO DRAM, which in turn replaced
plain ol' DRAM (dynamic random access memory), properly termed FPM =
fast page mode DRAM, in the middle '90s (though few ever called it that).



Quantity of RAM is usually the biggest limiting factor towards perceived
machine performance, in my experience, so it's wise to stuff in as much
as possible, and use the highest-RAM-density sticks you can.  If
perchance your old machines can use the above sticks, _you want them_, 
if only as spares (because sometimes your sticks will develop faults).


[1] IT businesses, in characteristic fits of optimism, tend to pull
computers out of service whenever those machines show signs of
unreliable operation, intending to figure out their problems, but then
reality intrudes:  If reloading the OS doesn't make the problem vanish,
typically little or no further diagnosis is even attempted, as either
they don't know how or cannot spare the time.  This is probably what
happened with the Supermicro 2U server:  Nobody thought to check it for
bad RAM.  Similarly the free RAM I was trying to use in 2006 had
probably been yanked as suspect but never tested.

[2] Which has its problems, so I'm glad I can elect to have it not 
active at all, especially given that Supermicro makes it accessible
only with its proprietary Java application IPMIView.  Which in turn
lead to  http://www.kb.cert.org/vuls/id/648646 .  Note the warning
that IPMI should be carefully traffic-restricted.


-- 
Cheers,                 "The crows seemed to be calling his name, thought Caw."
Rick Moen                                     -- Deep Thoughts by Jack Handey 
rick at linuxmafia.com 
McQ! (4x80)        




More information about the conspire mailing list