[conspire] Cheese it, it's the (French) cops!

Fri Dec 7 21:25:19 PST 2018

> Date: Wed, 28 Nov 2018 18:02:37 -0800
> From: Rick Moen <rick at linuxmafia.com>
> To: conspire at linuxmafia.com
> Subject: [conspire] Cheese it, it's the (French) cops!
>
> Paul asked:
>
>> BTW, is it necessary to have the French Police siren?
>
> Context is that the scrounged PIII-based Rackspace pizza-box server
> still running linuxmafia.com has a not-very-loud alert sound going
> off 24x7, which IMO is slightly annoying during indoor CABAL meetings,
> but not IMO greatly so.  The alert is a two-tone thing slightly
> reminiscent of the police siren you encounter in Pink Panther movies
> (thus the name Duncan Mackinnon bestowed on this background noise).
>
> I'm following up, however, to share the reason why I find this audible
> alarm a really funny example of how to screw up the human-factors aspect
> of computer hardware design, and also to warn that similar screw-ups abound.
>
> So, to review, what the 'French Police' alert seems to be about:  This
> Rackspace box is the ratty old server enclosure that from the early
> 2000s until a decade ago ran lists.svlug.org.  It lived inside Joe
> McGuckin's small colo in Palo Alto, Via.net, until Via.net pulled the
> plug on hosting SVLUG for free, at which point the unused hardware got
> dumped in my garage.  I didn't do a thing with the machine until the day
> my final spare VA Linux 2230 motherboard died.  Just on a hunch, I
> detached my SCSI hard drives from the dead motherboard, connected them
> to the Rackspace one, flipped the power on, and my server came fully
> back online.  Great, do another full backup, then leave the machine the
> hell alone pending migration to something modern.  (Which hasn't
> happened yet.)  And, the more-than-faint-but-less-than-loud alert tone
> has been present all the time it's been powered on.
>
> What's it alerting about?  Well, the Rackspace came with dual PIII CPUs,
> but one cannot help noticing that one of those CPUs has a frozen,
> no-longer-spinning CPU fan on top of it.  This is in general terms a
> Very Bad Thing, and I consider it highly likely that the PIII under the
> frozen fan died the death shortly after the fan froze up.  (This is one
> reason why I vastly prefer passive cooling over anything mechanical,
> particularly fans.)  I'd be very, very surprised if that audible alarm
> is saying anything but 'Hey, server owner.  You might want to know that
> one of your two CPUs is utterly borked.'

And, had earlier occurred to me, but I'd not peeked until just wee bit
ago, ... I actually have login access to that host, so ...

$ ssh -ax linuxmafia.com. 'exec cat /proc/cpuinfo'
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 8
model name      : Pentium III (Coppermine)
stepping        : 10
cpu MHz         : 1000.089
cache size      : 256 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge  
mca cmov pse36 mmx fxsr sse
bogomips        : 2000.17
clflush size    : 32
cache_alignment : 32
address sizes   : 36 bits physical, 32 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 8
model name      : Pentium III (Coppermine)
stepping        : 10
cpu MHz         : 1000.089
cache size      : 256 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge  
mca cmov pse36 mmx fxsr sse
bogomips        : 2000.22
clflush size    : 32
cache_alignment : 32
address sizes   : 36 bits physical, 32 bits virtual
power management:

$

So, still shows both CPUs there, ... so ... maybe not (totally?) borked?
It might be alarming about the failed fan, ... or something else.  Who
knows what.  Could probably even run some command(s) to see if
processes are running on both CPUs.  I don't recall if that CPU
model, etc. has multiple cores or not ... I don't think so, but I don't
recall for sure (and yes, that can of course be looked up).  There's also
pretty good documentation around about how to interpret the contents of
the /proc/cpuinfo file.

Use of
# dmidecode
might also be useful - it may display and decode enough hardware information
to possibly indicate if there are detected hardware error(s) present.

> But here's what's so flippin' hilarious:  Rackspace pizza-box servers
> are intended for colos, not meditation chambers.  I mean, for Ghu's
> sake, the name of the company is _Rackspace_.  And the reason why the
> machine ran for untold years at Via.net while screaming its little heart
> out to the best abilities of a tinny little speaker is that
> _nobody could possibly hear it in a noisy colo_.
>
> Like, good grief, whoever decided to use a somewhat faint audible alarm
> for major hardware failure in a _colo_ just didn't bother to think at
> all.  That's like overhead billboards in Braille.  It's unclear on the
> concept of 'colo'.
>
>
> But, as mentioned, this sort of hilarious failure to think through human
> factors is something you find widely in computer hardware.  Anyone
> remember when, in 2006, I tracked down, using iterative kernel compiles
> with 'make -i NNN' set high enough, two bad ECC DIMM sticks in my _prior_
> VA Linux 2230 motherboard?
>
> http://linuxmafia.com/pipermail/conspire/2006-December/002662.html
> http://linuxmafia.com/pipermail/conspire/2006-December/002668.html
> http://linuxmafia.com/pipermail/conspire/2007-January/002743.html
>
> As mentioned in the second message, _after_ finding the two bad DIMMs,
> it occurred to me to wonder why the system hadn't informed me about this
> highly detectable hardware problem.  I mean, it was ECC (error
> correcting-code) memory in a ECC-enabled server motherboard (Intel
> L440GX+ 'Lancewood').  Shouldn't it grab the admin's attention to say
> 'Hey, there's severely defective RAM in this system.  Maybe you should
> fix that before you corrupt all data passing through memory.'?
>
> It should have.  It didn't.  Part of the issue was that the L440GX+
> BIOS was not configured to enable the maximal amount of extended-memory
> testing, which Reg Charney, the machine's prior owner had doubtless set,
> as pretty much anyone would, because POST-checking gigs of memory adds
> an absurd amount to boot-up time and seldom is useful.  Enabling that
> extra amount of checking _did_ cause BIOS messages to display at boot
> time (only) suggesting, e.g., that there was a bad stick in socket J0,
> but there were still two problems with that:
>
> 1.  The message displayed only at boot.  Part of the point of running
> a server is to not boot it more than once in a blue moon.
>
> 2.  The message displayed only at the console.  Part of the point of
> running a server is to (usually) run it headless (no monitor).
>
> So, another overhead billboard in Braille, basically -- and this
> on something really important, defective RAM.
>
>
>
> Your third example was something I encountered on the job around 2005,
> while I was working at Linux firm California Digital Corporation in
> Fremont.  This was in some senses a successor firm to VA Linux Systems,
> having bought out all of VA Linux's remaining parts and remaining in the
> hardware market VA Linux Systems abandoned to chase after proprietary
> software.  (VA Linux systems rather stupidly refused to provide its
> branding or customer lists to California Digital, and those were just
> dropped in the ashcan.)
>
> California Digital was run by owners BJ Arun and his wife.  One day, a
> VA Linux Systems customer visited us in great distress, seeking our
> help, and bringing his collapsed Oracle server, a VA Linux Systems 2250,
> which was a 2U rackmount machine with the same Lancewood motherboard
> but also a hot-swap SCA SCSI backplane so that hard drives could be
> swapped in and out if any failed.  The backplane was served by an
> (expensive) Mylex hardware-RAID SCSI host-bus adapter, a PCI card.
>
> On this guy's machine, there were maybe six 36GB SCSI hard drives in the
> drive bays, configured in the Mylex RAID set as a RAID5 array.  Some of
> these drives had failed, one after another, so that there were (as of
> that day) just barely fewer than the minimum number of drives still
> living to support the RAID array.  The storage volume had dropped offline,
> Oracle RDBMS had choked, and _suddenly_ the customer realised he had a
> huge problem.  (And yes, he had completely failed to have backups.)
>
> Arun worked out with the customer that I would work on a
> time-and-materials basis for a day to see if I could revive his Oracle
> files.  By judicious fiddling, I was able after a couple of hours to
> convince the most-recently-failed SCSI drive to go back online and be
> activated in the Mylex configuration, which brought the data volume
> (tenuously) back to life.  Then, crossing my fingers, I plugged a spare
> 36GB Seagate Cheetah (IIRC) SCA SCSI drive into the hotplug backplane,
> told the Mylex controller to add it to the RAID array, let it remirror,
> and at the end decommissioned the recently-failed drive.  Last, I added
> replacements for the other failed drives, did another remirror, and
> yanked the other failed drives.  Total time was, I believe, six hours of
> billable time plus three spare hard drives.  Sadly, the customer was
> _not_ thrilled -- which he should have been -- but rather bitched
> operatically about how expensive it was for me to have just saved his
> bacon.  Some people just are never happy.
>
> But the point is:  The guy had suffered _progressive_ failure of three
> (IIRC) mission-critical server hard drives on a vital Linux server with
> an expensive RAID controller, chewing gradually through the hot spares
> and then the live storage until catastrophic failure happened, but had
> had absolutely no idea this syndrome was occurring and worsening.  Why?
>
> I think the Mylex was one of the ExtremeRAID models with this BIOS-based
> firmware:  http://www.aselebyar.nu/imgupload/doc_1177508286.pdf
> Docs tell you all about how to configure a RAID array, etc.
> Let's say you're very, very wary, and are plowing through the docs to
> find an answer to the question 'How am I, the admin, going to be
> informed that my hot spares are being used up, and that, hey buddy,
> maybe you should start replacing failed hard drives?'  You look, you
> look and you look some more.  Eventually on page 2-16 you find this
> oh-by-the-way:
>
>    View/Modify the Device Health Monitoring (S.M.A.R.T.) Setting:
>
>    Default = Disabled
>
>    If Enabled, the Device Health Monitoring (S.M.A.R.T.) feature
>    will scane physical devices for Information Exception Conditions
>    (IEC) and Predictive Failure Analysis (PFA) warnings and return
>    the warnings to the user.  S.M.A.R.T. stands for Self-Monitoring
>    Analysis and Reporting Technolgogy.
>
> Oh, terrific.  First of all, it's _disabled by default_.  But second,
> the vague handwave about 'return the warnings to the user' indicates the
> further and arguably worse problem.  What does that mean?  Hoist a flag
> and play Yankee Doodle Dandy?  Send a stiff letter to the _Times_?
> Oh, dunno, maybe lob an e-mail or SMS to the admin?
>
> Surprise: none of the above.
>
> All the Mylex docs _really_ mean the card will do, when it receives
> 'Hey, this drive is sick and about to die' information from a hard
> drive's electronics, is log the information into the Mylex card's
> non-volatile memory in standard S.M.A.R.T. format.  How is this
> vital and time-sensitive warning data 'returned to the user'?  The Mylex
> attitude, quite typically, was 'Not my problem, Jack.  Sounds like
> something the admin should be taking care of.'
>
> Which is to say, on your choice of operating system including Linux, you
> can _have_ softare running that keeps an eye on S.M.A.R.T. data and
> sends you a telegram or whatever if there's a problem.  Forgot to do
> that?  Oops, you lose.  Gosh, shame that all that redundancy and
> checking data kept getting logged but nobody was looking at it.
>
> Just another billboard painstakingly crafted in Braille.  Something
> computer hardware designers seem to do on any random day ending in 'y'.
>
> (Oh, BTW:  https://en.wikipedia.org/wiki/Smartmontools )