[conspire] The first rule ... & DNS & SOA serial numbers (everything(?) you [n]ever wanted to know about DNS SOA serial numbers, but were afraid to ask ; -))

Mon Mar 7 11:52:52 PST 2016

Quoting Michael Paoli (Michael.Paoli at cal.berkeley.edu):

> Yes, excellent point, need to *check*, and *verify*.

Yes, thank you:  That was my overriding point.  If the task is '[foo]
must happen', you're not done until you've shown [foo] happening.
Therefore, if the task is 'All my authoritative nameservers must respond
with this new value for a record', the necessary and logical last step
is to query those nameservers, and make sure your change actually went
out.

The Linux person in question (**COUGH** Ruben Safir **COUGH**) muffed
the S/N on a minor change.  When I as his DNS secondary informed him of
his S/N goof (not a very impressive turn of events, but stuff happens),
caught for me by logcheck, he wrote back shortly afterwards saying he'd
'fixed' it.  A brief check using dig showed that nothing had been fixed
_in real-world DNS results_.  Whatever it was that he 'fixed', it didn't
work -- which is also arguably in the 'stuff happens' category.  

But what _was_ notable, and not a 'stuff happens' item, is that he
(obviously) didn't check actual DNS results.  And that is the First Rule:
If the task is that [foo] DNS should be published, then you're not done
until you've shown [foo] item being published.

This troubleshooting dictum applies strongly to every technical task,
not just DNS -- and it's a basic precept of test-driven development.
https://en.wikipedia.org/wiki/Test-driven_development

And it just means being results-oriented, frankly.  If Deirdre sends me
to the store for a bottle of milk, and I come back saying I'd seen and 
evaluated several bottles and several cartons of milk, but for various
reasons didn't buy one -- though I'd also observed a cow in a field on
the way home -- Deirdre would be justified in writing down on the Big
Whiteboard of Household Tasks:

Task:  Acquire bottle of milk:  FAILED

Then we refactor Rick and repeat.

> That means, e.g., using dig(1).  Just because one thinks the data is
> or looks right in the zone file doesn't mean it's being served up, or
> served up as expected.

Yes, and everyone who's done primary nameservice for more than about a
day learns this.  However, my point was that if you are mindful of the
First Rule, you don't even need to know that.  You'll hear 'I checked
the zonefile in vi', and immediately smell rodent.  You'll say 'Yes,
that's all very well, but looking at the zonefile is not checking DNS.
In order to check DNS, you have to, y'know, do DNS.'

> E.g. most DNS server software, when given a misconfigured
> zone file, will generally continue to serve the data they got from the
> zone file last time they successfully loaded it, and will ignore a bad
> zone file (and will typically complain about it to logging facility or
> log file).  If the DNS admin isn't paying attention, they could miss
> the fact that the updated zone file wasn't loaded at all due to some
> error it contains.

This is why it's useful for the task to include running
# tail -f $ZONENAME /var/log/daemon.log
Because since the task involves 'Nameserver daemon must load the zone',
you're not done until you've shown that it loads the zone.

Something you may not know:  The linting routines missing from BIND9
were supplied in /usr/sbin/named-checkconf , which I highly recommend.

At my former firm of many years, where I administered DNS for hundreds
of domains, I developed a specific invocation of named-checkconf that
everyone referred to as the Rick Test, that became an obligatory part of
all DNS change control procedures.  It was a little like:

# named-checkconf -t /var/named/chroot -z | grep -v 'loaded serial' 

...except that I had a much longer grep -v list, after (if memory
serves) looking through the named-checkconf source code and making a
list of every possible diagnostic that we really just didn't care about.

named-checkconf with the -z option first checks the syntactic validity
of named.conf and all #includes, then (the -z bit) the syntactic
validity of all referenced zonefiles.  

After adopting the Rick Test, we never had any instance of a zonefile
refusing to load.

> nslookup *was* deprecated.  It *was* going to go away (and I say good
> riddance - dig is so much better/nicer - though it does take a wee bit
> of getting used to when first switching from nslookup).  But sometime
> subsequent to that, nslookup got a reprieve (darn), and was no longer
> slated to go bye-bye forever.  Not sure if it's changed again since
> then, but that's last I recall reading on the matter some moderate
> number of years ago.  (Likely there's answer on Wikipedia ... let's see
> ...) Yup, ... *was* deprecated ... then changed to not deprecated in
> 2004
> https://en.wikipedia.org/wiki/Nslookup
> https://lists.isc.org/pipermail/bind-announce/2004-September/000155.html

Yes, they said that as of BIND 9.3, but they never clarified why.  I've
always been more than a bit suspicious.  Did they rewrite the underlying
code so it no longer gives the provably erroneous results chronicled by
Jonathan de Boyne Pollard at
http://homepage.ntlworld.com/jonathan.deboynepollard/FGA/nslookup-flaws.html ?
Much of that badness resulted from nslookup's recycling of old and
extremely buggy BIND8 code.  Did they junk that code and rewrite the
tool?

I could investigate this, but have no incentive because I moved on to
'dig' around 2001, and regard nslookup as a terrible, badly designed
tool, ergo I really no longer care if it's been fixed.

And 'fixed' in this case is a word whose definition is a bargaining
point (as to the scope of what needs fixing).  I happen to agree with de
Boyne Pollard that it's absurd for nslookup to rely on its own internal
DNS client instead of using the same system-wide DNS client libraries
everything else uses.  Therefore, if ISC 'fixed' nslookup by shimming in
client software from BIND9 and ripping out the ancient client software
from BIND8, my reaction would be that a fix that doesn't address
grievous design flaws is not a fix.

Anyway, you're more than welcome to investigate the situation, and tell
me what you find.  I continue to say it's deprecated.  By _me_, at
minimum.  Because it's terrible in several individual and compelling
ways.

> Whatever service manager and/or init system one has on one's operating
> system, usually also includes a reload capability - so most of the time
> one needn't even need to know the capabilities of rndc (but for large
> DNS site administrators, recommended to reasonably well learn rndc - it
> has, for example, capabilities to only reload one single zone - thus
> skipping the rereading of other zone files - which can be a huge factor
> for large DNS sites).

True, but knowing to use rndc is a basic skill for BIND administrators 
-- and the init script's reload stanza is, IIRC, just a thin wrapper
around rndc (and a blunt-tool one at that, instructing rndc to reload
_all_ zones -- needlessly in the usual case where you've changed just
one zone).

> Some of the very good reasons to generally *not* restart a DNS server:
> Most notably, if one mucks up a zone file, in many cases, the DNS
> server will (if it's data/config error that's invalid) typically not
> load the zone file, and will log the error (or at least failure to load
> the zone), and will continue to serve the older valid zone data it
> loaded.

Note that use of named-checkconf -z averts this scenario entirely.

> More bits that can and do go wrong.  And even with - or quite closely
> related to SOA serial numbers.  Notably timing.  Let's say one has been
> using a certain scheme for serial numbers, and one wants to change it.

Some people prefer using seconds since the Unix epoch (which you like),
or some other code-friendly scheme.  As you know, I don't.

I'm sometimes asked:  Suppose you have an actual need for more than 100
zonefile changes per day?  Then YYYYMMDDnn cannot work, can it?

My answer is that, if you're doing _that_ much zone updating, you
probably should be using PowerDNS Authoritative Server rather than
BIND9, NSD, or any of the other authoritative nameserver packages that
rely on RFC 1035 ("BIND") zonefiles as back-end storage.

PowerDNS Authoritative Server ('pdns') applies a radical approach to the
problem: Its back-end is a SQL database (default MySQL), all DNS RR
updates are conducted atomically as a per-record change, and AXFR/IXFR
zone transfers are eschewed entirely:  Instead, pdns merely relies on
SQL replication between pdns nodes.  Hence, there are no zone transfers,
and also the S/N subfield has no use and is disregarded.

> Or, other case, a booboo was made with serial numbers.  

Yeah, about that:  At work, we relied heavily on a locally hacked
'serial.sh' made invoking standard shell tools (mostly grep, awk, and
sed) to set appropriate S/N values.  The workflow went:

o  Check the head version of all the zone files out of version control
o  Make whatever changes are required.
o  run serial.sh on all changed zone files in your working directory.
o  Check changes into version control.

I've just found a reasonable replacement (as I didn't keep a copy of the
one from work), and have put it here:

http://linuxmafia.com/pub/linux/network/serial
...and also in /usr/local/bin.

Use a tool like that _instead_ of manually frobbing a zone's S/N date in
a text editor, and you will never have erroneous S/Ns in a zonefile
again.

> We also have to
> potentially worry about unlisted slaves and the like ... e.g. if we
> allow more than just slaves to do AXFR on our zone.
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Well, Don't Do That, Then.  allow-transfer ACLs are your friend.

You know, Michael, one thing continually amazes me.  It's not difficult
to teach people how to do DNS right -- but one seems to spend all of
one's time trying to un-teach people who've been doing parts of it
wrong, and it's just amazing how many creative ways people find to mess
up.

> But where there was serial number booboo, or one wants to change
> scheme, where going from old existing, to desired new, the new is not
> "greater" than the old, then one needs to proceed carefully and in
> appropriate manner to ensure all gets properly updated.

This, actually, is a standard problem with standard (if painful and
tedious) solutions, which can be found in the usual places (serverfault,
etc.) if one ever needs them.   

> MINIMUM / Negative Cache TTL - that's not really a "minimum" per se.

Since 1998 when the subfield's meaning was changed, I've referred to
this simply as 'negative TTL'.  Much less confusing -- and less likely
to be misread by old-timers still thinking of the old meaning of
'default TTL' the subfield used to have.

> Typos happen.  Sometimes in SOA serial numbers.  8-O

But never if you use /usr/local/bin/serial .