[conspire] fine if you aren't noticing symptoms: Re: (forw) [BALUG-Admin] Weekly cron job to check on my domains' nameservers

Sat Sep 9 13:35:46 PDT 2023

Quoting Michael Paoli (michael.paoli at cal.berkeley.edu):

> Reminds me of the too often encountered:
> Host is dead, or severe I/O problems on filesystem(s).
> "Of course" it's production (well, far too often).
> Start digging and checking, ah, lovely, all protected with RAID-1 great!
> Uhm ... except ... the first drive of the RAID-1 died N months/years
> ago, and no monitoring was put in place, or it's been entirely ignored.
> And now the only other drive in that RAID-1 pair has failed ...
> lovely ... ugh.  Oh well, at least we have backups ... oh ...
> nobody ever bothered because ... RAID-1 ... or that started failing
> N months/years ago - but nobody's monitoring or the monitoring has been
> ignored.

[snip many other fine examples of "Looks good, and we've heard nothing,
so we'll assume everything's fine."]

That really is a vicious syndrome, isn't it?  Potentially a
company-killer, too.  I'll cite two examples:  

The first client for my network consulting business, in the early 90s,
was a small SoMa clothing manufacturer named NE Wear (which I'm free to
name since they're not around anymore -- not my fault).  As I was
kicking around the office, fixing things, I suddenly had a Sherlockian
"dog that didn't bark in the night"[1] moment, while gazing at their backup
setup.  Bunch of tapes, DDS2 DAT drive, Netware server, admin
workstation, all good, about to review procedures, but... wait.  What?
Where...?

  Me to the office admin:  "Excuse me, where are the cleaning tapes?"
  Office admin (saying the words of doom):  "What's a cleaning tape?

The _larger_ problem (as soon emerged) was absence of test restores, but 
the absence of cleaning tapes was horrifying, so remedying backup became
Job One.  I'll elide details.  For one thing, I've doubtless told the
story in this space before.

But the incident scared me, as the firm could have died of incompetence
_on my watch_, as it was a near-certainty that exactly zero of their
backup sets were readable, and thus they were doing backup-as-ritual,
that being what an former boss of mine sarcastically called "almost
useful".  As a teaching device, I adopted that day the signature quirk
of referring to backup systems as "_restore_ systems", and would always
cheerfully explain that restoring was the part that mattered, and I
wanted to focus management's attention onto that.

In the 2000s, after I was laid off from VA Linux Systems, I got a
relatively decent survival job doing dual-role sysadmin/IT at small
Fremont Linux hardware firm California Digital Corporation ("CDC"), run
tight-fistedly by BJ Arun and his wife, which among other things had
bought out all remaining VA Linux Systems inventory when it suddenly
exited the hardware business during the Dot-Bomb.  I remember CDC with
fondness even though management was a little demented, but there was one
striking incident:

A panicked business owner sought our help with his VA Linux Systems
model 2250 dual-PIII 2U rackmount server (w/Intel "Lancewood" L440GX
motherboard), one of the top-of-the-line ones with SCA 80-pin backplane
for up to 8 hot-swappable enterprise SCSI drives and a Mylex hardware
RAID controller.  He was in a panic because this was his Oracle server,
and some of the eight Quantum Atlas V drives had progressively failed
until, that morning, the RAID5 array had gone offline because there were
now one-too-few drives in the RAID set.

I remember that we at VA sold a ton of those Mylex controllers to
higher-end customers.  I actually have one among my box of spare
PCI expansion boards, but have never used it.  Word in the community was
that these boards were the very best, most reliable things.

Arun tasked me to do a best-efforts attempt to resurrect the customer's
server.  He cleared this with the customer, in particular the hourly
rate that would be charged for my time, and I was thus to keep a careful
record of time spent and all materials used.

Over a period of (I think?) six or seven hours, I managed to coax the 
just-offlined drive back online, and then restriped onto several
replacement Atlas V drives, one at a time, getting the hardware back to
spec.  The customer got his Oracle server back -- and then complained
bitterly about the cost of 6-7 hours of technician time and several
replacement SCSI drives.  (Sheesh, what an ingrate.)

Arun tasked me with writing up a statement (for the customer) justifying
the billable hours I had racked up.  I thought this was irrational in
context, and was yet another example (among many) of management failing
to have the back of us hires when we did exactly as ordered by them.
(CDC's demise a few years later was rapid and self-induced, but that's
another story.  I had jumped ship before that happened.)

But the _point_ of this story:  All those years I'd been working on and
arround people's VA Linux Systems SCSI-based servers, including the
ones with pricey Mylex controllers and hot-swap drive bays, and getting
deep into the guts of the firm's Red Hat Linux variant, RH-VALE (Red Hat
with VA Linux extensions)...  I'd just taken for granted that _some_
good, highly reliable mechanism _must_ be in place to make sure server
stakeholders get _prominently_ advised of RAID5-member drive failures.

Right?  Surely?

I honestly don't know the answer to that, having never actually
administered such (pricey, noisy, power-gobbling) servers, only worked
around/on them.  So, I don't know whether the distraught owner of an
erstwhile Oracle server had found some creative-chump way to disable
hardware status notifications from the Mylex controller, or if he got
them but ignored them, or if they were missing from the beginning.

But, I'll tell you, it reminded me of why that Sherlockian trick can be
so important, and I try to train myself to spot where a protection has
been just assumed to be there, or assumed to work, but nobody has
checked in ages or perhaps ever.

And, getting back to my weekly report on domains' health, that's why I 
keep thinking of ways to improve it to make it more terse when
everything's fine, and just talky enough upon detection of a problem.
The script is frankly not very good, especially from the point of view
of "human factors", i.e., presentation.  Its main virtue is that even a
perfunctory checking script is better than the nothing I had before,
that lead to that "scruz.net" debacle of six secondary auth. nameservers
silently flaking out, with nobody noticing.

My initial goal with the /etc/cron.weekly/mydomains cron job was _just_
to make sure that would never happen with my own domains, and it
achieves that goal -- clumsily, so far.

[1] Holmes's "dog that didn't bark" trope was so brilliant that everyone
remembers that bit but most people cannot name the story to save their
lives.  It was "Silver Blaze", the name of a famous racehorse that
disappeared one night.  In the morning, the vanishing got noticed, but
also the horse's trainer was found murdered just outside the stable.

Holmes was able to solve the case because the guard dog was known to
have not barked that night.  Therefore, the intruder had been someone it
knew and didn't consider an intruder.  

  Gregory (Scotland Yard detective): “Is there any other point to which
  you would wish to draw my attention?”

  Holmes: “To the curious incident of the dog in the night-time.”

  Gregory: “The dog did nothing in the night-time.”

  Holmes: “That was the curious incident.”

Holmes figured out that the trainer had conspired with a gang to steal
the horse, but that the horse had kicked him in the head and killed him
during the theft.
https://brieflywriting.com/2012/07/25/the-dog-that-didnt-bark-what-we-can-learn-from-sir-arthur-conan-doyle-about-using-the-absence-of-expected-facts/