[conspire] Last Year's Supercomputer

Mon Jun 9 00:41:01 PDT 2003

Mark, I was still a little groggy, when I wrote that post (and in a
hurry to get out the door.  Commenting on my own post:

> You know, they're been offering some good deals.  Bear in mind that it's
> difficult to even approach whole-system pricing when assembling a system
> from parts, though.  I like to do the latter because I can control parts 
> selection and quality, but am aware there's a substantial premium.

What I mean is:  If you were quoting Fry's pricing for a pile of
individual components, please be aware that you can typically do a
_whole_ lot better on pricing if you buy one of their assembled systems.
Sometimes, the latter work out better even if you end up having to
extract, discard, and replace a few crappy components.

> Well, the good news is that Linux, like Unixes generally, will tell you
> very clearly that you have a RAM-defect problem through continual SIG 11
> errors and segfaults.  If you know what that means, then you use
> memtest86 to confirm your suspicions, and then you swap out the
> offending stick.  Because Linux has this (sort of) built-in RAM-defect
> alarm system, I don't consider it cost-effective for that OS.

By "it", I meant ECC:  ECC is mostly a crutch for poor bastards stuck on
NT, who need it to warn them that their data are turning to mush.
You're on Linux, so save your money:  Put the RAM through initial
burn-in on memtest86 for at least 24 hours, and you'll be fine.  If you
ever start getting SIG11s or frequent segfaults, that's Linux telling
you to wake up because you've developed a RAM (or CPU) problem.

Reminds me of a story:

When I assembled my K6/233 tower system (still running perfectly, never
any parts failures, very cool, PC Power & Cooling Turbocool -- _not_ a
coincidence!), I used a cheapo FIC PA-2007 motherboard and RAM/CPU bought
separately from SA Technology.  For the K6, I also got a big-kahunga
heat sink with a fan on top that uses ball bearings instead of the
standard, noisy, failure-prone sleeve bearings.  As RAM, I got SDRAM
_rated for CAS2 operation_ at 100 MHz, which was the way you got
extra-good performance at that time.  (CAS = column-access strobe.
I mean that memory access could be set to require only two CAS cycles,
instead of the standard three.)

Anyhow, I banged everything together.  It _seemed_ to work fine.  The
big tower case, low-heat-output CPU, and major cooling capacity meant
that the thing ran very cool and reasonably quiet even with the two very
fast IBM 10,000 RPM SCSI hard drives.  

As was my custom, I started compiling a kernel after a bit.  The
compile errored out with a SIG11.  Hmm.  Tried again.  Errored out even
faster, and at a different point in the compile.  Odd.  Do I have bad
RAM?

Shut down and re-seated the RAM.  Went downstairs to the CoffeeNet for
coffee.  Came back, fired up, ran the compile.  No problems.  Ran the
compile again:  SIG11.  Ran it again:  SIG11 at a different place.  Ran
it with only half the RAM:  Same symptoms.  Ran it with the other half:
Same.  Didn't seem to be a RAM problem(?).

Slept on the problem.  Woke up, fired up the machine, ran a compile:  No
problem.  Ran it again:  No problem.  Ran it a third time:  SIG11.  Ran
it again:  Same error, different spot.

Pondered the problem for a bit:  It seemed as if the error was kicking
in only after the system reached heat equilibrium, but not during
the initial 30 minutes of operation when the system was still
stone-cold.  But that didn't make much sense:  I opened up the case and
re-verified that the system really _was_ an engineer's dream of
conservative design, and that even the 10,000 RPM drives were running
cool.

I drove down from San Francisco to the Palo Alto Fry's and bought both
the heat-conductive pads you can sandwich between CPUs and their heat
sinks _and_ the thermal paste you can use instead.  I was going to make
double-sure that I had good contact, there, before taking the CPU back
to SA Technology and looking like an idiot if it turned out to be
perfectly OK.

I took the heat sink off the CPU, cleaned both off, put paste on them
and pressed them together.  For some reason, perhaps on account of some
bell ringing in my unconscious mind, a few minutes later I pulled them
apart again to look at the two pieces more closely.

And swore.

The motherboard socket for the K6 has a square outline, and the pins
have a square pattern, with one corner of the CPU's pin-layout being
different so you won't destroy it by putting it in the wrong way.  When
you put the CPU into the socket, if you aren't looking too closely, you
think: square socket, square CPU with a keyed feature to keep you from
screwing up, square heatsink/fan assembly.  1, 2, 3 -- done.  Foolproof.
(Well, no.)

The _top_ surface of the CPU turned out to have a heatsink-contact
surface that extended only across maybe 70% of the lateral distance, and
the other 30% was sunk down lower.

  ---------------------
  |            |      |
  |            |      |
  |            |      |
  |            |      |
  |  contact   |      |
  |  surface   |      | 
  |            |      | 
  |            |      |
  ---------------------

I'd accidentally rotated the heatsink 180 degrees, so that _its_ contact
surface was staggered over to the right-hand-side:

  ---------------------
  |      |            |
  |      |            |
  |      |            |
  |      |            |
  |      |   contact  |
  |      |   surface  |
  |      |            |
  |      |            |
  ---------------------

Only a little strip in the middle was actually touching, so all the
other thermal paste was still protruding up into the air, un-squashed.  
Most of the CPU top surface was getting _zero_ help with cooling, and
instead was radiating out into a nice warm, insulating air pocket under
the heatsink.

If this had been an Athlon or P4 Coppermine, the CPU probably would have
committed seppuku, but the K6 was completely undamaged and has been
happily cranking away ever since.

But that experience confirmed me in my prejudice that a cool CPU
(like a cool system generally) is much, much, much to be preferred over 
one that runs hot and needs heroic measures like huge amounts of forced
air flow to stave off disaster.  Those other ones may be faster -- but 
usually (for most machine roles) don't even manifest that speed in ways
you especially care about.

You wrote:

>  21 Tekram DC-315U Ultra-SCSI/SCSI-2 controller PCI card
>     with internal and external connectors.

That's certainly inexpensive.  I'm a little unclear what you're using it
for, since you didn't include any SCSI components -- nor any hard drives
of any sort, actually.  

The spec sheet at tekram.com claims it's based on a Tekram S1040
chipset -- which is unusual, since Tekram SCSI cards are best-known as
inexpensive, cost-effective implementations of good ol' LSI / Symbios / NCR 
chipsets such as the 53C810, 53C815, 53C875, 53C895, and such.  I also
notice it has _no BIOS_, which has two consequences:  (1) Most likely,
Linux will not be able to auto-probe the card.  Instead, you'll have to 
tell the booting kernel about it via command-line parameters in your
lilo or GRUB configuration.  (2) You won't be able to _boot_ from SCSI
devices, period.

The "U" in the model number without a "W" to go with it means it's
so-called "ultra-SCSI" without a wide data path.  So, you have an 8-bit
= 1 byte data path (50-pin connectors), capable of 20 MB/sec transfers
across the SCSI bus.  That's not very high, these days.  Furthermore and
(along with the 8-bit data path) more of a problem for long-term
usability, the card doesn't support LVD (low-voltage differential)
operation, which puts terrible limits on total chain length and
(consequently) the number of devices that can reliably operate on the
chain.

Tekram does have other models that _can_ support LVD and wide (2 byte
wide) SCSI, such as the DC-390U2B, which uses the tried-and-true
LSI/Symbios 53C895 chipset.  It also supports up to 40 MHz transfers
across the SCSI bus, which, with the 2 byte wide data path, means it'll
do 80 MB/sec, maximum.

So:  I have no idea _why_ you're buying a SCSI card for a system with no
SCSI devices listed -- but I'd pay $70 more for something like the
DC-390U2B instead of sinking $21 into something like the DC-315U, whose
drawbacks and limitations over the long term would just frustrate me.

I'd get a long-term good SCSI card even if my only current use for it
were driving a flat-bed scanner, just so it would work well in other
roles, later.

If you're using ATA ("IDE"), and have more than one hard drive, for
heaven's sake make sure you put them on different channels.  No matter
which type you use, _if_ you're buying new hard drives, consider ones
that spin at fairly high rotational speeds such as 10,000 or 15,000 RPM,
since that addresses one of the fundamental limiting factors on hard
drive performance.  Beware, of course, of the downside:  heat buildup.

I always _really_ like having two physical hard drives, as well, in
order to address the other fundamental limiting factor, namely seek time
-- and make a point of having a swap partition on each of them.

I'm guessing that you're intending to carry forward your hard drive(s),
video card, keyboard, mouse, and monitor.  Fair enough.  I would, too.

-- 
Cheers,                    I've been suffering death by PowerPoint, recently.
Rick Moen                                                     -- Huw Davies
rick at linuxmafia.com