[conspire] Slice of life

Carl Myers cmyers at cmyers.org
Fri Sep 18 11:05:30 PDT 2009


This resonates a lot with me.  Those of you who may have met me at BerkeleyLUG
might have some idea what company I am talking about, but let's just call them
"Argentina.com".  They had some *very* cool tools around these ideas, but at the
same time, lots of holes in their process too.

When you run a website as your primary business, it is a very different feel
from shrink-wrapped software.  There is no "big milestone" or "major release", at
this particular company, updates were deployed to the production website every
single evening.  In such an environment as this, there was an "alpha", <some
developer's desktop>, a "beta" <integration environment>, a "gamma" or "master"
<exact replica of production, using production data, but not externally facing>,
and of course production.  They had an entire deployment system, which was
tightly coupled with the build system.  Everything (even perl and interpreted
languages) was "built" by the build system into "packages" which the deployment
system could then deploy to "environemnts".  Environments were versioned too, so
you could always say "oh shit, make this environment look like it did 10 minutes
ago before I broke the hell out of it".

These systems had some problems, and lots of room for improvement - but it was a
brilliant system which served them well.  The main problem was it gave devs a
false sense of security and they were far more likely to "roll it forward" than
"roll it back" or better yet adequately test their change to begin with.

Ironically, the one system they did *NOT* manage this way, was their DNS.  I can
recall at least one time when a 6-hour outtage was caused by someone pushing an
untested breaking DNS change to production (there was no beta or gamma, only
alpha and prod, and the deployment system used for DNS was basically "scp").  It
turns out can be really difficult to revert a change to DNS when that change
breaks your company's entire DNS, internally and externally.  I'm pretty sure
that person was not with the company much longer.

I suppose, long story short, on first inspection Rick's change notification
below may seem inane or counter-productive, but the reality is, such a process
would have saved a company I know of at least a million dollars on just one
incident, probably more long-term.

-Carl

On Thu, Sep 17, 2009 at 07:23:14PM -0700, Rick Moen wrote:
> Date: Thu, 17 Sep 2009 19:23:14 -0700
> From: Rick Moen <rick at linuxmafia.com>
> To: conspire at linuxmafia.com
> Organization: Dis-
> Subject: [conspire] Slice of life
> 
> At $FIRM, nothing may be changed in the production environment without
> approval of a formal change control proposal.  Part of what I do
> involves managing DNS and domain registration for the firm's (at last
> count) roughly 1840 Internet domains, not counting reverse DNS.
> 
> Here's a typical change control request for DNS that I banged out a few
> minutes ago.  (Yes, still using cvs in 2009.  Sad, I know.)
> "example.com" gets cited, here, in place of the real domain name.
> 
> The "named-conffile" test used is something I invented to compensate for
> BIND9's notorious tendency to die with no usable diagnostic information
> whenever there is _any_ syntax problem in its conffiles or any DNS zonefile.
> It's now considered mandatory, here, for all DNS changes, and locally termed
> "the Rick test".  (Having to figure out why BIND is dying on startup on
> the master nameserver, with no clues, under time pressure, is no fun
> whatsoever.)
> 
> Point is:  Any operation that's serious about quality and process
> control does this, i.e., everything's in a VCS, every change has a
> change window and pre-scripted backout procedure, and the live
> public environment is never "hotfixed", but rather receives planned
> change pushes from the development environment, accompanied by testing
> them.  If anything goes wrong, either I or anyone else can run the
> backout procedure and immediately revert changes.
> 
> This is also yet another reason why shell scripting, sed, awk, find,
> grep, xargs, perl, etc. persist over crappy GUI toolsets:  They permit
> relatively writing in advance of precise, tested change routines with
> matching backout procedures.  I can thus edit 600 domains at once, as
> easily and precisely as I can two.
> 
> Notice that the backout procedure checks the previous zonefile revisions
> out of cvs and then bumps their serial numbers _twice_ each.  Anyone
> who's managed DNS is nodding, right now, because the iron rule of
> zonefile S/Ns is that they must always go only upwards.
> 
> 
> 
>  Subject: Change Control Request: Sunfire and Netapp DNS
>  Date: Thu, 17 Sep 2009 18:34:38 -0700
>  From: Rick Moen 
>  To: [operations mailing list], [colleague]
> 
> 
> *** Needs Assessment -
> 
> [Colleague] recently deployed a Sun Microsystems SunFire X4540 in [data
> centre] as a network storage device with hostname ki5-18.example.com =
> IP 10.112.2.165.  He's reported some problems, however, because that
> IP's reverse DNS (PTR entry) doesn't match, resolving to hostname
> ki33-29.example.com, instead.
> 
> Also, [colleague] needs new DNS entries for netapp121.example.com
> = IP 10.22.0.121 and netapp122.example.com = IP 10.22.0.122.
> 
> Reference:  RT-133005
> 
> 
> *** New Features -
> 
> Correct one rDNS entry, add two new forward entries.
> 
> 
> *** Assumptions -
> 
> None.
> 
> 
> *** Risks -
> 
> Low.  This is a known procedure.
> 
> 
> *** Process/Procedure -
> 
> #In my sandbox:
> cd ~/cvs/site/confs/named/master
> 
> Add to zonefile e/example.com-cage, in the netapp section:
> ; Added for RT-133005 - rmoen 20090917
> netapp121       IN  A   10.22.0.121
> netapp122       IN  A   10.22.0.122
> 
> sed -i 's/ki33-29/ki5-18/' reverse/2.112.10.in-addr.arpa
> 
> #Update S/N:
> /site/bin/ops/serial.sh e/example.com-cage
> /site/bin/ops/serial.sh reverse/2.112.10.in-addr.arpa
> 
> # Double-check changes:
> cvs diff e/example.com-cage
> cvs diff reverse/2.112.10.in-addr.arpa
> 
> cvs ci -m "Add 2 example.com entries, fix ki5-18 rDNS - RT-133005" \
> e/example.com-cage reverse/2.112.10.in-addr.arpa
> merge_patcher \
> e/example.com-cage reverse/2.112.10.in-addr.arpa
> sudo r2qa -u -q devel,staging,20090901 \
> e/example.com-cage reverse/2.112.10.in-addr.arpa
> sudo p2c --cluster=admin \
> e/example.com-cage reverse/2.112.10.in-addr.arpa
> 
> #On ii53-30:
> #Double-check BIND conffile:
> /usr/sbin/named-checkconf -z -t /var/named/chroot/ /etc/named.conf | \
> egrep 'missing|not allowed|unknown|not at top of zone|\
> appears to be an address|no current owner name|MAXTTL|file not found|\
> may not be used with|outside epoch|in future|invalid|unsupported|no TTL|\
> ignoring| TTL set to prior TTL' | sort -u 
> #Should return null.
> 
> # Reload zones:
> rndc reload
> 
> # Check /var/log/messages for errors.
> 
> 
> *** Change Window -
> 
> When convenient.
> 
> 
> *** Test Plan -
> 
> dig netapp121.example.com. @localhost +short
> dig netapp122.example.com. @localhost +short
> dig -t ptr 165.2.112.10.in-addr.arpa. @localhost +short
> # Should return, respectively, 10.22.0.121, 10.22.0.122, and
> # ki5-18.example.com.
> 
> 
> *** Back out Procedure -
> 
> cd ~/cvs/site/confs/named/master/
> 
> cvs up -r 1.341 -p e/example.com-cage > e/example.com-cage
> cvs up -r 1.3 -p reverse/2.112.10.in-addr.arpa >  reverse/2.112.10.in-addr.arpa
> #Update S/Ns:
> /site/bin/ops/serial.sh e/example.com-cage
> /site/bin/ops/serial.sh e/example.com-cage
> /site/bin/ops/serial.sh reverse/2.112.10.in-addr.arpa
> /site/bin/ops/serial.sh reverse/2.112.10.in-addr.arpa
> # Double-check changes:
> cvs diff e/example.com-cage
> cvs diff reverse/2.112.10.in-addr.arpa
> cvs ci -m "Reverting RT-133005 changes" \
> e/example.com-cage reverse/2.112.10.in-addr.arpa
> merge_patcher \
> e/example.com-cage reverse/2.112.10.in-addr.arpa
> sudo r2qa -u -q devel,staging,20090901 \
> e/example.com-cage reverse/2.112.10.in-addr.arpa
> sudo p2c --cluster=admin \
> e/example.com-cage reverse/2.112.10.in-addr.arpa
> 
> #On ii53-30:
> #Double-check BIND conffile:
> /usr/sbin/named-checkconf -z -t /var/named/chroot/ /etc/named.conf | \
> egrep 'missing|not allowed|unknown|not at top of zone|\ 
> appears to be an address|no current owner name|MAXTTL|file not found|\
> may not be used with|outside epoch|in future|invalid|unsupported|no TTL|\
> ignoring| TTL set to prior TTL' | sort -u 
> #Should return null.
> 
> # Reload zone:
> rndc reload
> 
> # Check /var/log/messages for errors.
> dig netapp121.example.com. @localhost +short
> dig netapp122.example.com. @localhost +short
> dig -t ptr 165.2.112.10.in-addr.arpa. @localhost +short
> # Should return, respectively, null, null, and
> # ki33-29.example.com.
> 
> *** Approval -
> 
> Pending:  Another Sr. SA, [Manager].
> 
> 
> _______________________________________________
> conspire mailing list
> conspire at linuxmafia.com
> http://linuxmafia.com/mailman/listinfo/conspire

-- 
Carl Myers 
PGP Key ID 3537595B
PGP Key fingerprint 9365 0FAF 721B 992A 0A20  1E0D C795 2955 3537 595B

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: Digital signature
URL: <http://linuxmafia.com/pipermail/conspire/attachments/20090918/201f69b1/attachment.pgp>


More information about the conspire mailing list