[conspire] The first rule

Rick Moen rick at linuxmafia.com
Tue Mar 1 04:08:28 PST 2016


Let me share with y'all the most important, first rule of technical
work:  If the task is 'X must work', then the task isn't complete until
you've demonstrated that X works.

Simple, right?  But I can't tell you how many times I've heard 'Well, I
_think_ it should work.'

Because I run a DNS nameserver, I do secondary ('slave') DNS for a bunch
of people's domains.  The theory is, you do secondary for their domains;
they do secondary for yours.  Doing secondary is dead-simple, setup and
forget:  You just make your secondary pull down data from the primary.
Done.  Dead-simple.

Being diligent, I also watch for signs of trouble, though.  Some days
ago, a major Linux guy's domain triggered an ongoing error in secondary
nameserver because suddenly his domain zonefile started trying to send
mine a S/N value _lower_ than the one in my secondary nameserver.

It was trying to offer allegedly updated S/N '21150228' when my
nameserver already had '2013040202'.  Which, you will perceive, is a
higher number.

Now, this was some sort of hapless editing screwup.  The recommended
value for a zonefile S/N is based on the current day's date, and is
YYYYMMDDnn, where nn starts at 00 and end at 99, for a total of 1000
revisions you can make to a domain in 24 hours.  So, today, March 1st,
the first S/N value for a zonefile would be 2016030100, then 201603101,
etc.  

The iron rule of S/Ns is that they always must ascend or you get serious
trouble, because your secondaries will think they already have a later
zone version than what the primary is trying to send to replace it: 
The secondary always says 'Ah, newer S/N; I seem to need to accept new
data.'  If not, no revision.

So, the domain owner made a hapless edit error.  Bad, but it happens.  I
sent him mail saying what was wrong and that he needed to fix.  A few
hours later, he said it was fixed.

This is where the First Rule comes in.  My nameserver kept showing the
same problem (reported to me by logcheck).  Which meant, no, he didn't
fix it at all.

What did he fix?  He's a little unclear on this, but it seems likely he 
made some local edit, but _never queried DNS_.

Problem.

If the task is 'DNS must be serving up a correct S/N over the network',
then the task isn't complete until you've demonstrated that DNS is
serving a correct S/N over the network.

The correct tool for this is /usr/bin/dig (or nslookup if that's all you
have, as is true by default for Windows users -- but nslookup is buggy
and deprecated).  'dig' queries the public DNS.

Editing a zonefile and staring at it is _not_ querying the public DNS.
It is notoriously common for people to edit such a zonefile and, e.g.,
fail to reload it.  The only _relevant test_ is to query it exactly like
a public user of the DNS -- e.g., using dig.

This guy didn't, so he was completely unaware that he'd totally failed
to address his problem.

It also turns out, this guy's idea of how to update a zone's record was
to edit it and then _restart the nameserver software_ (BIND9).  I
pointed out that restarting BIND9 just to reload a single zone is like
rebuilding an automobile engine just to change the oil.  I mean, it
works, but it's awesomely slow and inefficient.  Turns out, this guy had
never heard of 'rndc', the BIND9 tool that can (among other things)
signal the BIND9 daemon to reload into memory a zonefile revised
on-disk.

So, he didn't know the basics of the tools that underlie his entire
Internet presence.

But the far worse problem is that he had a screwball notion of what
determines whether a task is completed.

And anyone can get _that_ part right.





More information about the conspire mailing list