[conspire] Cheese it, it's the (French) cops!

Rick Moen rick at linuxmafia.com
Wed Nov 28 18:02:37 PST 2018


Paul asked:

> BTW, is it necessary to have the French Police siren?

Context is that the scrounged PIII-based Rackspace pizza-box server
still running linuxmafia.com has a not-very-loud alert sound going 
off 24x7, which IMO is slightly annoying during indoor CABAL meetings,
but not IMO greatly so.  The alert is a two-tone thing slightly
reminiscent of the police siren you encounter in Pink Panther movies
(thus the name Duncan Mackinnon bestowed on this background noise).

I'm following up, however, to share the reason why I find this audible
alarm a really funny example of how to screw up the human-factors aspect
of computer hardware design, and also to warn that similar screw-ups abound.

So, to review, what the 'French Police' alert seems to be about:  This
Rackspace box is the ratty old server enclosure that from the early
2000s until a decade ago ran lists.svlug.org.  It lived inside Joe
McGuckin's small colo in Palo Alto, Via.net, until Via.net pulled the
plug on hosting SVLUG for free, at which point the unused hardware got
dumped in my garage.  I didn't do a thing with the machine until the day 
my final spare VA Linux 2230 motherboard died.  Just on a hunch, I
detached my SCSI hard drives from the dead motherboard, connected them 
to the Rackspace one, flipped the power on, and my server came fully
back online.  Great, do another full backup, then leave the machine the
hell alone pending migration to something modern.  (Which hasn't
happened yet.)  And, the more-than-faint-but-less-than-loud alert tone
has been present all the time it's been powered on. 

What's it alerting about?  Well, the Rackspace came with dual PIII CPUs,
but one cannot help noticing that one of those CPUs has a frozen,
no-longer-spinning CPU fan on top of it.  This is in general terms a
Very Bad Thing, and I consider it highly likely that the PIII under the
frozen fan died the death shortly after the fan froze up.  (This is one 
reason why I vastly prefer passive cooling over anything mechanical,
particularly fans.)  I'd be very, very surprised if that audible alarm
is saying anything but 'Hey, server owner.  You might want to know that
one of your two CPUs is utterly borked.'

But here's what's so flippin' hilarious:  Rackspace pizza-box servers
are intended for colos, not meditation chambers.  I mean, for Ghu's
sake, the name of the company is _Rackspace_.  And the reason why the
machine ran for untold years at Via.net while screaming its little heart
out to the best abilities of a tinny little speaker is that 
_nobody could possibly hear it in a noisy colo_.

Like, good grief, whoever decided to use a somewhat faint audible alarm
for major hardware failure in a _colo_ just didn't bother to think at
all.  That's like overhead billboards in Braille.  It's unclear on the
concept of 'colo'.


But, as mentioned, this sort of hilarious failure to think through human
factors is something you find widely in computer hardware.  Anyone
remember when, in 2006, I tracked down, using iterative kernel compiles 
with 'make -i NNN' set high enough, two bad ECC DIMM sticks in my _prior_
VA Linux 2230 motherboard?

http://linuxmafia.com/pipermail/conspire/2006-December/002662.html
http://linuxmafia.com/pipermail/conspire/2006-December/002668.html
http://linuxmafia.com/pipermail/conspire/2007-January/002743.html

As mentioned in the second message, _after_ finding the two bad DIMMs, 
it occurred to me to wonder why the system hadn't informed me about this 
highly detectable hardware problem.  I mean, it was ECC (error
correcting-code) memory in a ECC-enabled server motherboard (Intel
L440GX+ 'Lancewood').  Shouldn't it grab the admin's attention to say
'Hey, there's severely defective RAM in this system.  Maybe you should
fix that before you corrupt all data passing through memory.'?  

It should have.  It didn't.  Part of the issue was that the L440GX+
BIOS was not configured to enable the maximal amount of extended-memory
testing, which Reg Charney, the machine's prior owner had doubtless set,
as pretty much anyone would, because POST-checking gigs of memory adds
an absurd amount to boot-up time and seldom is useful.  Enabling that
extra amount of checking _did_ cause BIOS messages to display at boot
time (only) suggesting, e.g., that there was a bad stick in socket J0,
but there were still two problems with that:

1.  The message displayed only at boot.  Part of the point of running 
a server is to not boot it more than once in a blue moon.

2.  The message displayed only at the console.  Part of the point of
running a server is to (usually) run it headless (no monitor).  

So, another overhead billboard in Braille, basically -- and this
on something really important, defective RAM.



Your third example was something I encountered on the job around 2005,
while I was working at Linux firm California Digital Corporation in
Fremont.  This was in some senses a successor firm to VA Linux Systems,
having bought out all of VA Linux's remaining parts and remaining in the
hardware market VA Linux Systems abandoned to chase after proprietary
software.  (VA Linux systems rather stupidly refused to provide its
branding or customer lists to California Digital, and those were just
dropped in the ashcan.)  

California Digital was run by owners BJ Arun and his wife.  One day, a
VA Linux Systems customer visited us in great distress, seeking our
help, and bringing his collapsed Oracle server, a VA Linux Systems 2250,
which was a 2U rackmount machine with the same Lancewood motherboard 
but also a hot-swap SCA SCSI backplane so that hard drives could be
swapped in and out if any failed.  The backplane was served by an
(expensive) Mylex hardware-RAID SCSI host-bus adapter, a PCI card.

On this guy's machine, there were maybe six 36GB SCSI hard drives in the 
drive bays, configured in the Mylex RAID set as a RAID5 array.  Some of
these drives had failed, one after another, so that there were (as of
that day) just barely fewer than the minimum number of drives still
living to support the RAID array.  The storage volume had dropped offline,
Oracle RDBMS had choked, and _suddenly_ the customer realised he had a
huge problem.  (And yes, he had completely failed to have backups.)

Arun worked out with the customer that I would work on a
time-and-materials basis for a day to see if I could revive his Oracle
files.  By judicious fiddling, I was able after a couple of hours to
convince the most-recently-failed SCSI drive to go back online and be
activated in the Mylex configuration, which brought the data volume
(tenuously) back to life.  Then, crossing my fingers, I plugged a spare
36GB Seagate Cheetah (IIRC) SCA SCSI drive into the hotplug backplane, 
told the Mylex controller to add it to the RAID array, let it remirror,
and at the end decommissioned the recently-failed drive.  Last, I added
replacements for the other failed drives, did another remirror, and
yanked the other failed drives.  Total time was, I believe, six hours of
billable time plus three spare hard drives.  Sadly, the customer was 
_not_ thrilled -- which he should have been -- but rather bitched
operatically about how expensive it was for me to have just saved his
bacon.  Some people just are never happy.

But the point is:  The guy had suffered _progressive_ failure of three 
(IIRC) mission-critical server hard drives on a vital Linux server with
an expensive RAID controller, chewing gradually through the hot spares
and then the live storage until catastrophic failure happened, but had
had absolutely no idea this syndrome was occurring and worsening.  Why?

I think the Mylex was one of the ExtremeRAID models with this BIOS-based
firmware:  http://www.aselebyar.nu/imgupload/doc_1177508286.pdf
Docs tell you all about how to configure a RAID array, etc.  
Let's say you're very, very wary, and are plowing through the docs to
find an answer to the question 'How am I, the admin, going to be
informed that my hot spares are being used up, and that, hey buddy,
maybe you should start replacing failed hard drives?'  You look, you
look and you look some more.  Eventually on page 2-16 you find this
oh-by-the-way:

   View/Modify the Device Health Monitoring (S.M.A.R.T.) Setting:

   Default = Disabled

   If Enabled, the Device Health Monitoring (S.M.A.R.T.) feature
   will scane physical devices for Information Exception Conditions
   (IEC) and Predictive Failure Analysis (PFA) warnings and return
   the warnings to the user.  S.M.A.R.T. stands for Self-Monitoring
   Analysis and Reporting Technolgogy.

Oh, terrific.  First of all, it's _disabled by default_.  But second, 
the vague handwave about 'return the warnings to the user' indicates the
further and arguably worse problem.  What does that mean?  Hoist a flag
and play Yankee Doodle Dandy?  Send a stiff letter to the _Times_?
Oh, dunno, maybe lob an e-mail or SMS to the admin?

Surprise: none of the above.

All the Mylex docs _really_ mean the card will do, when it receives
'Hey, this drive is sick and about to die' information from a hard
drive's electronics, is log the information into the Mylex card's 
non-volatile memory in standard S.M.A.R.T. format.  How is this
vital and time-sensitive warning data 'returned to the user'?  The Mylex
attitude, quite typically, was 'Not my problem, Jack.  Sounds like
something the admin should be taking care of.'

Which is to say, on your choice of operating system including Linux, you
can _have_ softare running that keeps an eye on S.M.A.R.T. data and
sends you a telegram or whatever if there's a problem.  Forgot to do
that?  Oops, you lose.  Gosh, shame that all that redundancy and
checking data kept getting logged but nobody was looking at it.

Just another billboard painstakingly crafted in Braille.  Something 
computer hardware designers seem to do on any random day ending in 'y'.

(Oh, BTW:  https://en.wikipedia.org/wiki/Smartmontools )





More information about the conspire mailing list