[conspire] Why you should always test your backups

Rick Moen rick at linuxmafia.com
Wed Aug 8 14:35:08 PDT 2018


Quoting Paul Zander (paulz at ieee.org):

> In the long run, the guy who wiped out the production is probably
> lucky to be away from a company with with multiple attitude problems.

Concur.

The junior dev at the unidentified firm in the
https://www.reddit.com/r/cscareerquestions/comments/6ez8ag/accidentally_destroyed_production_database_on/
merely clumsily knocked over a dangerously defective edifice, and the
error he made of using provided default access data _certainly_ should
not have been capable of any kind of full read/write/delete-level access
to the production database or to anything else in production.  That's
just crazy.  

As Deirdre suggests, that ghastly blunder on their part was outranked by
the much worse and more-damning one of having no tested backups of a
critical production database.  And, who does that?  Even a trivial MySQL
setup, used in business, is periodically dumped to SQL files stored
separately -- and then tested by replaying the SQL dump file into a
scratch machine to make sure valid data results.  By the time the
(culpable) CTO got around to scapegoating and threatening the hapless
junior dev, and the (culpable) DBA was unable to restore, yeah, pretty
clearly a terrible company.


The next comment in thread order, about the contrasting incident at
Amazon Web Services, is instructive:

1. An internal dev accidentally ran a maintenance script (on Christmas
Eve!) that deleted state data for existing customer load balancing.  The
problem manifested as customer performance issues, calling attention to
the missing state data.

2. A couple of efforts to restore load balancing state from snapshots 
ensued, re-enabling affected load balancing, and checking results over
Christmas Eve and Christmas Day.  

3. Crisis having been handled, Amazon reviewed handling of access to 
affected state, correctly spotting a hole in its change management (CM)
process, which it fixed, and also improved the recovery process to make
it faster and more automated in any future mishap.  _And_ it wrote up
a detailed incident report detailing root cause & sequence of events,
the cited variant of which it made public.

4. According to the Redditor, the poor shlub who ran the script wasn't
punished because his poorly thought out script run should not have been 
physically possible without CM review and signoffs.  

And _that_ is how to do it right.  As I said during my time running
system administration at Linuxcare, the trick is to make new and
different mistakes.  This requires learning from and take measures to
avert the known ones.




> I was never in IT, but I used to ask about the IT department when
> interviewing.   The one time I didn't ask, I found myself at a company
> that treated IT as a cost to be minimized without concern for how much
> time and frustration that caused for everyone else.

Well, this may depress you, but _every_ company treats IT as a cost to
be minimised.  Some are merely worse in this attitude having corrosive
effects on IT than are others.  At the really bad example firms, the
fact that IT isn't a 'revenue centre' but rather a 'cost centre' (a
basic and undeniable truth from managerial accounting) gets hurled about
to undermine the IT department and to excuse all cosly screwups by Sales
Department (etc.).

One interesting question one might _imagine_ asking about IT matters 
during a job interview -- but unfortunately the interviewers probably
won't know the answer -- is how the company allocates IT costs
internally.  At most firms, IT is treated as straight overhead expense
with no attempt to allocate it to the business units that need and use
its resources.  At a minority of them, sundry usage metrics provide a
calculation to charge incurred IT costs/wages to the managers of groups
that incur the work.  I've never been at one of those, but can say it
would have been _enormously_ satisfying to reply to the CADD operator
who told me 'Hey, man, security is not my job, it's yours' after he'd 
just incompetently infected with the Concept virus thousands of MS-Word
documents on a NetWare file server 'You might change your mind about
that after $20,000 in cleanup costs take away your entire department's
budget.'



So, about testing backups:  I learned a small lesson about this, myself
-- fortunately, not the hard way -- concerning linuxmafia.com server
backups.  The basic scheme for those is publicly documented, at
http://linuxmafia.com/faq/Admin/linuxmafia.com-backup.html .  (Although
I have copies of this page, of course, having it verifiably archived at 
web.archive.org qualifies under the Torvalds Plan:  'Only wimps use
tape backup: real men just upload their important stuff on ftp, and let
the rest of the world mirror it ;)' -- Linus Torvalds on lkml, 1996.)

Some time in 2013, though, I discovered a hole in my backup plan.  See
item #8 in the list of data directories to back up?

   /var/mail                    SMTP mail spool

Prior to 2012-3, that line (in my documented backup scheme) was:

   /var/spool/mail              SMTP mail spool

...but a cross-check of my backup sets revealed that the rsync script
had merely been creating (if memory serves) a dangling symlink called
'mail', and not catching any of the spool directory's contents.  How 
could this be?  /var/spool/mail had been a regular directory holding 
SMTP-received mbox files for each deliverable local user since time
immemorial.  It wasn't a symlink.

I went and checked.  At some point during Debian upgrades, the contents
of /var/spool/mail had been moved to /var/mail and replaced by a
symlink.  Some Debian twinkie had decided /var/mail was more
standards-compliant and violated the Principle of Least Surprise by 
moving important system data files and not informing the sysadmin.
(I don't believe I was told, anyway.  I think I'd have noticed.)

So, one lesson is:  Spot-check your backups widely.  Another is:  Don't
trust that something that's been true since time immemorial is still so,
without occasionally checking.






More information about the conspire mailing list