[conspire] looking for good FAQ Websites for CPU heatsink/Fan hardware

Fri Apr 4 12:49:13 PDT 2008

Quoting K Sandoval (indigo.kai at gmail.com):

> I looked at one CPU Fan/Heatsink while I was at Central Computers on
> Monday, and it was HUGE!  I had no idea CPU Fan/Heatsinks were that
> large.  We are talking the size of a Large Starbucks Travel Mug.

Yeah, crazy, isn't it?  One of the reasons I'm relatively content, for
now, with my gradual, inadvertant lapse into retro-computing (i.e., PIII
servers, PPC G3 iBooks) is that at least the heat output (and power
draw) was and is reasonable.  I've been hearing too many stories of
people having problems with heat buildup[1] with recent gear for it to
be entirely coincidental.

> I am not quiet sure I like this idea.  I saw Rick mention something
> about bearings and I was wondering, can someone direct me to a good
> website that I review and research CPU hardware?

I see you got references to some of the tweaker[2] sites.  You can also
just Web-search on fan manufacturers; sometimes, that helps.  I'm sorry
I'm not giving you more-specific advice on how exactly to distinguish
good from bad:  The reason is that I simply haven't needed to solve that
problem, lately (as to CPU fans).  

I _think_ that some CPU-fan literature/documentation specifically
mentions using something other than sleeve bearings (e.g., ball
bearings), but couldn't swear to that.  It's been so long, I can't
clearly recall.

> Kai: I had one of the engineers at VMware recommend PQI RAM for my
> eMachine, and I have had no problems that I can tell with the two 1GB
> RAM Sticks currently in my eMachine.  That is why I was looking at PQI
> RAM again.

Well, if you have reason to have confidence in the VMware guy, you could
choose to rely on that.  And it's not like cheap RAM can't be perfectly
fine.  Or, you could figure out the delta of street-pricing between PQI
and Mushkin / Crucial / Corsair / Geil / Micron, and decide whether the
extra money buys enough peace of mind to merit the extra expense.  On
that point:

> How do you know when you are having RAM Memory Issues?  What kind of
> symptoms suggest a RAM Memory problem?

Unfortunately, the symptoms of bad RAM are difficult to distinguish from
that of problems in related hardware:  defective CPU, intermittant /
dirty RAM-slot connectors, defective motherboard, general heat buildup,
components not fully seated in their slots / sockets.  If it's a gross,
i.e., severe RAM defect, then you might, once in a blue moon, actually
get some benefit from the alleged RAM checker routine that runs during
the motherboard's power-on self-test (POST), the one triggered by the
BIOS ROM-based boot-up sequence where you see RAM counting upwards from
zero to the full amount of RAM in the machine, at the early part of
startup.

I am biased against BIOS ROM memory-checking routines generally, because
in my experience they're close to 100% useless:  Almost all RAM defects
serious enough to cause system instability turn out, apparently, to be
too subtle to be detected by the BIOS ROM-based checker.

However, please be aware that _some_ BIOS ROM Setup programs have the
ability to set greater or lesser amounts of RAM-checking upon startup,
and the ability to enable or disable ECC checks.  So, with the right
options set, it's not unknown to actually _find_ bad RAM using the
built-in BIOS-based checker.  Just don't count on it.

Back to the symptoms:  Short of being told by the BIOS ROM-based
checker, upon startup, that one of your sticks might be bad, the usual
symptoms of bad RAM are as follows, in increasing order of severity:
segfaults (segmentation faults) of applications, kernel panics[3], or 
just plain spontaneous reboots with no explanation and no entry in the
logfiles suggesting why.

A word about segfaults:  It's an error message that's _normally_
supposed to indicate a program attempting to read or write a
non-existent or invalid segment (physical chunk) of memory or one not
allocated to it.  However, if there's a bad piece of physical RAM, and
the program is trying to use it, for reasons not entirely clear to me,
when the read/write operation fails or returns garbage, that often
manifests as a segfault message reported to stderr (the standard error
output stream), something like this:

$  /usr/bin/[some program]: line 20: 11624 Segmentation fault [some program]
...and the program process dies.

But there's a gotcha here, that I wanted to call to your attention:  GUI
program launchers such as those of KDE / GNOME / xfce4 tend to
_swallow_ (lose/whatever) the stderr stream of applications they launch.
I.e., if messages like segfault errors crop up, you aren't shown them.  

Let's say, for example, that you are in the habit of running
OpenOffice.org Writer from the GNOME menus.  You're doing that on a
newly built machine with a new set of RAM sticks, and OO.o is
(naturally) the biggest, most bloated application you ever normally run.  
You notice, however, that it tends to mysteriously bomb out of memory
with no explanation at unpredictable intervals averaging about three
minutes -- bad!  

You seek help on a mailing list like this one, and someone advises you:
"Kai, try launching OO.o Writer from a terminal window, rather than from
GNOME's menus."  You open GNOME Terminal, or xterm, or whatever.  Let's
say you're not really sure how to lauch OO.o Writer directly from a
terminal, so, you launch it from the GNOME menu, and then, quickly
before it blows up, switch to the terminal window and do

 $ ps auxw | grep "office"
kai      13593  0.0  0.1   1944   612 ?        S    11:58   0:00 /bin/sh /usr/lib/openoffice/program/soffice -splash-pipe=5
kai      13608  1.8  9.5 130892 36660 ?        Sl   11:58   0:03 /usr/lib/openoffice/program/soffice.bin -splash-pipe=5
kai      13644  0.0  0.2   3156   844 pts/2    R+   12:01   0:00 grep office
 $

So, you guess that you can do 
$ /bin/sh /usr/lib/openoffice/program/soffice.bin -splash-pipe=5 &
...and you'd be right.  (The "&" means "start this thing as a forked-off
process so that I get my command prompt back.)

(It's an unfortunate example, because a better way to start OO.o Writer
from the command line is to just type "oowriter &".  However, never mind
that.  I'm just saying the messier incantation is probably what the
GNOME launcher will do.)

The point of all this:  Because you started OO.o Writer from the command
line, you haven't lost the stderr stream:  The process will still report
back errors to the launching terminal.  Which is what you want.  When
Writer segfaults a few minutes later, you get to _see_ the reason why.
Which is yet another reason why it's useful to know how to launch / kill
/ control process from a terminal, and not just from the menus of a
desktop environment.

Anyway, suppose you observe a pattern of ongoing segfaults.  You think: 
Is it just this one application?  (Why _just_ OO.o Writer?)  One
difference is, OO.o Writer might just be the _biggest_ (most RAM-using)
application you normally run.  So, you think, either the application
itself is corrupted for some reason, or it (more than other apps)
happens to be hitting a bad stretch of RAM -- maybe because the bad RAM
in question has a high enough memory address that it seldom gets used
except when I'm either running a lot of things or one really big thing.

You think:  But the problem is that segfaults _could_ be caused by a
stick of RAM with a bad spot, or by a corrupted application file, or by
some other key hardware component with a hardware problem.  How do I
narrow down which one?

You try The GIMP, and edit some large graphics file.  If The GIMP also
segfaults after a few minutes, then you have greater confidence that
it's the size of app (total memory usage).  If not, maybe you try
booting a live CD distro and launching OO.o Writer from that.  (It's
unlikely that _both_ your installed OO.o Writer and a copy on a live CD
are corrupted.)

At some point, if it keeps looking like bad RAM, you think:  Is there a
way I can test directly if it's bad RAM or not?  You open your box (if
it's not a laptop) and count DIMMs.  If there's more than one, you could
try removing particular DIMMs and re-testing.  If the problem vanishes,
you suspect the particular DIMM you just removed.  You try putting it
back and seeing if the problem recurs.  (Don't forget to _also_ try 
moving the DIMMs around, in case the problem is not a memory stick but
rather a defective memory _socket_.)

(You should also just try re-seating all components, _before_ messing
around with removing RAM sticks.)

The above may be more than you wanted to know about problem diagnosis --
or not -- but the point is that it's the sort of diagnosis process one
_must_ go through if you have mysterious, recurring problems that might 
(or might not) be caused by a stick of RAM that either was defective
from the factory or has developed a bad spot, e.g., because of a power
spike.[4]

99 times out of 100, there's absolutely nothing wrong with cheap RAM of
dubious pedigree.  That 100th time, it puts you through some trouble (as
above).  Some would say paying a bit more for RAM that you have
confidence in (on account of brand and/or vendor) is cheap insurance.
Others disagree.

Quality in RAM (or hard drives, etc.) _isn't_ just a matter of brand
name -- which doesn't necessarily signify.  Don't forget what I said
about the sorting / grading process that results in some parts being
sold by the manufacturer as down-graded to lower speed ratings and/or as
"seconds", and some retailers being bottom-feeders who on average tends
to have junkier and less reliable gear for a variety of reasons
including buying sales lots of seconds / remaindered items / OEM "pulls"
and other slightly dubious components in order to save money on
inventory.  In some cases, vendors also never return to their
distributors parts that customers bring back as defective, instead just
re-shrinkwrapping them and putting them back out on the shelf.[5]

Thus, if you find a retail vendor you have confidence in, you might want
to trust to the retailer's quality practices.  For example, I've had
really good luck with S.A. Technologies of Santa Clara
(http://www.satech.com/), and often buy their "Excelerate" house-brand
RAM, because I have some confidence in them as a supplier.  Daniel
disagrees, asserting that they primarily sell _used_ RAM.  However, the
RAM *I* buy from them is guaranteed new, and in my experience has been
really good (and cheap).

[1] I had to laugh, a couple of years ago, when I figured out why all
the laptop manufacturers roughly simultaneously ceased using the phrase
"laptop computer" in their literature, with absolutely no explanation:
Apparently, a few users had been doing computing with their clamshell
PCs _literally_ on their laps, the little things developed heat
problems, and the user's family jewels got singed.  Legal hilarity
ensued, whereupon a bunch of manufacturers decided that it was a better to
change all company promotional materials such their lawyers couldn't
contradicted when they said "We never said to put it _on_ your lap."

In case you hadn't noticed, they're all officially "notebook computers"
or "personal workstations", these days.  _Never_ "laptops".

[2] Person who fiddles with computer hardware to improve its operations --
as opposed to "tweaker" = "abuser of stimulant drugs".  ;->

[3] I'm referring here to messages like "Uhhuh.  NMI received. Dazed and
confused, but trying to continue" in one of the main logfiles (and
generally also on text console #1), usually but not always followed by a
kernel panic and the seize-up or spontaneous reboot of the machine.  I
believe that, these days, the "NMI" (non-maskable interrupt) message
snippet goes on to give you a helpful guess along the lines of "you
probably have a hardware problem with your RAM chips or a power saving
mode enabled".  

I get the impression that the NMI messages are becoming a bit more
specific and helpful:  Through Web searching, I see that there's a
slightly different variant that sometimes comes up:  "NMI received for
unknown reason [something] on CPU 0 [or whichever it is].  You have some
hardware problem, likely on the PCI bus."  That's a step forward over
the rather vague messages we used to get.

[4] And _that_ is another reason to not use cheap, no-name PSUs that
could be marginal for the load you're going to put on them:  They're
much more likely to pass along PG&E power spikes to attached equipment
and damage sensitive parts such as RAM sticks.  Other people tend, I
notice, to have problems with RAM going bad, in some machines, and
they're never ones with Cooler Master, Enermax, PC Power &
Cooling, or Sparkle/SPI PSUs.  I've never had a stick _go_ bad in one of
my machines -- though my VA Linux 2230 server once had two sticks that
were evidently already bad when I got them (for free).

[5] In fairness, at least some of these vendors probably operate on a
reasonable theory that most customers bringing back "defective" gear are
idiots who couldn't navigate through their own houses with a map and a
flashlight, let alone through computer hardware diagnosis.  They
probably figure that most returned gear is actually perfectly fine --
and that, sending it back to the distributor is likely to lead to higher
prices or other problems upstream.  In an ideal world, they would be
able to check returned parts on the shop's own test-bench, but that's
seldom economical.