[conspire] Backup

Heather Stern star at starshine.org
Thu Feb 8 08:15:59 PST 2007

On Sun, Feb 04, 2007 at 09:36:56PM -0800, Paul Reiber wrote:
> Roger - I'll concur with Nick's comments and add a few of my own.

And I love how concise you are!  I'll add a few to the "threat model" list
(thanks Rick, for mentioning it for that point of view). 

/me picks her own brains for the many past reasons some client needed to
migrate or spin up a machine in a Big Hurry(tm)

> Potential points of failure, and whether they affect both drives in
> a "same box backup" solution:
> - fire, earthquake, flood - yes
  - temporary inability to get into building - maybe
> - power spike or brownout - yes
> - power supply or MB failure - probably
  - some clutz nailed the wrong power switch and even though they turn it back
    on it's not ideal anymore - probably
> - drive controller failure - probably
> - vulnerability of being hacked - yes
> - vulnerability of accidental human error - yes
> - failure of one of the drives - no
  - vulnerability of deliberate human error/insane sysadmin - yes 
  - ISP decides they don't need you as a customer anymore - yes
  - distro upgrade turns out not to be "up" - maybe* 
     - automated backups - yes, and the backup system "up"graded too
     - deliberate, sysadminly backups before the change - save your tail!

Or to simplify it further into something that can be a grid, though it may
need #d to be expressed well:
                          - basic physics    - human nature   - software
  - scale of problem
  - permanence of problem 
  - likely chance of problem
  - deliberate or accident

> You'll notice a boatload of "yes's" and "maybe's" - so
> your BS detector is accurate, and your head's definitely
> NOT in a FUD bucket.  I'm betting these guys don't have
> environmental sensors on their rack either...
The flip side of this is to ask, what good things does this kind of backup 
  - fresh bits in a hurry after that typo toasted you (fast restore,
    possibly even without official downtime)
  - freedom about when to do it, not waiting on a tape changer
  - remote admins can restore things without local hands (if it's set up
    well and preserves that concept)
  - warm fuzzies to managerial types who want to see a Backup Plan without
    actually putting any decent money into it.  Funny how they change their
    mind how much money it was worth *after* their universe crumbles.

An ISP who's making some sincere effort to provide auto backups for their
customers without indulging in massive-storage machines (that themselves
could become single-points-of-failure) could slip extra standard hard drives
into all the customers' machines, and have the backups go to someone else's
box, possibly in the rack halfway across the hall.  That would reduce the
risk of a circuit popping or an entire cage browning out taking out the
backup, while still keeping it on media that is capable (Nick's right, tape
storage just can't keep up) and nearby since most of the types of disasters
worth yelping for a backup still preserve the building.  There's still
indulgence in extra hard disks.  There's still risk that human error on the
otherboxen can scramble someone else's backup *** but these shouldn't be
mounted uservisible to that other user anyway, reduce the risk to the ops
team instead of any old doofus they let do webmastery. Maybe in an extra Xen
guest? *** and that the entire facility failing is still a bug.

In the worst backup story I've ever encountered, I was part of the crew
who cleaned up after the mess and rescued a lot of bits.  The backup
mechanism was the partial cause of failure - afaict it strained a drive
controller that was past its prime.  Human laziness in drive setup was
the cause of its failure being so drastic; the entire universe of things
they cared about lived on one giant RAID.  Which leaves the last note:
whilst RAID offers useful defense against single-HD failures, when its
entirety fails it's really nasty.  You get things like the mirror drive
won't mount up either because it thinks it should zero out now, or it
having helpfully memorized the exact same state that's about to crash again.
DBAs hate that...

There is some value in having some member of a company's top or middle brass
be the keeper of the small stack of DVDs that resembles an egg from which to
re-hatch their universe.   They are probably the people who will need to
address the sudden change in business plan that is needed if something
drastic really happens, anyway.  Them not being techie of the sort to do the
restore themselves is the least of their pain at that late date.

-* Heather
Documentation is like sex: when it is good, it is very, very good; and
when it is bad, it is better than nothing.  -- Dick Brandon

More information about the conspire mailing list