[conspire] DNS & nameservers ...

Tue Mar 20 03:27:04 PDT 2018

> Date: Wed, 5 Apr 2017 22:18:00 -0700
> From: Rick Moen <rick at linuxmafia.com>
> To: conspire at linuxmafia.com
> Subject: Re: [conspire] Unbound + dnsmasqd on openwrt
> Message-ID: <20170406051800.GT6577 at linuxmafia.com>
> Content-Type: text/plain; charset=utf-8

> Each DNS data item (called an RR = Reference Record) travels the Internet
> with an associated expiry time (a number of seconds after which they are
> to be regarded as stale and not used).  This is called the TTL = Time to
> Live value.  TTLs of 86400 seconds (1 day) or more are quite common.
> Ultra-long TTLs would make your published data highly persistent in
> other recursive servers' caches, but mean any changes to your records
> might take longer getting out to the public (because distant recursive
> servers already having the old records in cache don't know they're
> obsolete, only that their TTL hasn't expired).  Ultra-short TTLs permit
> rapid propagation of changes to your DNS records, but limit the lifetime
> (hence benefit) of third-party nameserver caching.

Well ... sort of kind of almost.  ;-)  TTL gives the *maximum* time (in
seconds) the RR will be cached (or actually, even more specifically,
is to be considered valid).  Nameservers *may* discard the data
sooner ... and many may do so, e.g. due to memory constraints/pressure,
or some other configuration bits - such as capping caching to some
particular max TTL time - or may use or combine such with
Least Recently Used (LRU) algorithms - favoring retention longer (up
to TTL) for things that are getting frequently or semi-frequently
referred to, and likewise disfavoring things that just aren't getting
repeated queries on any frequent basis.  Of course, too, where memory
is plentiful, there's often much less (to about zero) incentive to drop
things early/earlier from cache.

> Anyway, my master nameserver for my domains lives on the ratty old
> Pentium III in my garage, on slow aDSL to the Internet.  Sometimes the
> aDSL suffers an outage, usually because AT&T, the local large carrier,
> does something stupid that takes my ISP Raw Bandwidth Communications,
> offline.  But I don't worry about my DNS continuity of service, because
> my master nameserver is, from a public perspective, just one of five,
> and also most queries are satisfied out of some other recursive
> nameserver's cache, anyway.

Redundancy is a good thing.  :-)  *Reasonably* managed and maintained,
DNS is pretty dang robust and redundant.  But that's not to imply
DNS nameserver software or its data is immune from administrative /
management booboos or similar errors/flakiness.

> Some years ago, I got a lesson in why it's important to not merely have
> the recommended 3-7 authoritative servers (for a domain) but to
> periodically reverify that they still exist and are still serving up the
> domain:  I had been helping a somewhat haplessly half-assed LUG effort
> in Santa Cruz, CA called 'Smaug' (notionally the Santa Cruz Microsoft
> Alternative User Group).  ns1.linuxmafia.com was one of the domain's
> (scruz.org's) _six_ nameservers.  The other five were the personally
> operated nameservers of five other Santa Cruz locals.  One of them,
> mine, was treated as the master, which merely means that the zone
> contents are maintained there and the other five periodically refresh
> the zone contents from it.  Years passed.
>
> One day, my aDSL line had been down for a few hours because of a
> particularly epic AT&T screwup.  I got back online, and Smaug's mailing
> list (hosted at svlug.org) was full of complaints that scruz.org
> had been out of service, as in unresolvable (except where cached).
> Some of these complaints, to my particular irritation when I solved the
> problem, were from the other five individuals supposedly providing the
> scruz.org authoritative service.
>
> I got out 'dig' and checked out the other five nameservers.  Some of
> them no longer existed.  (In one case, the owner had moved it to a new
> IP address and never informed me as maintainer of the master
> nameserver.)  Another had just be taken out of service entirely without
> notice.  And the other three had been manually modified to no longer
> serve up scruz.org as authoritative data, but were still serving other
> zones -- and, again, their operators has said nothing to me as operator
> of the master zone.
>
> So, over a couple of years, all five of these guys had silently reduced
> scruz.org's service redundancy from the official six nameservers of
> record down to one (mine) -- but simply forgot to mention this fact.
> And a bunch of them were now scolding _me_ -- the only guy who hadn't
> shot the domain in the foot -- for unreliable nameservice.
>
> The lesson was:  Never assume people will do The Right Thing when the
> path of least resistance lies otherwise.  Monitor, monitor, monitor.

Yes, very true ... good to have appropriate monitoring in place ... or
if not (quite) that, at least some periodic checking - and preferably
automated, lest it be forgotten or dropped in complacency/laziness.

> Lesson having been learned, I fixed the domain's redundancy _and_ wrote
> up a quick-and-dirty weekly cronjob to report specifically to me about

Alas, I haven't (yet) dropped in such automated monitoring 8-O ... but I'm
typically mucking about in DNS semi-regularly enough I typically end up
checking it about a few times per month or so anyway, and certainly at
bare minimum multiple times per quarter ... so that's at least partial
coverage (yes, do still have on my long todo list to add some better and
more complete coverage on monitoring, and other automation.  Also of note,
many software monitoring packages have excellent capabilities to monitor
DNS ... but that may also be overkill, depending upon one's objectives).

Let's see if I have a semi-random example still in my shell history ...
Yep ... here's example and its output - not a full check, but ...:
$ (d=balug.org.; for ns in $(dig +short "$d" NS); do for nsip in $(dig  
+short "$ns" A "$ns" AAAA); do dig @"$nsip" +noall +norecurse +answer  
+nosplit +nomultiline "$d" SOA | sed -e 's/$/ ['"$nsip $ns"']/'; done;  
done)
balug.org.              86400   IN      SOA     ns1.balug.org.  
hostmaster.balug.org. 1521302930 9000 1800 1814400 86400  
[198.144.195.186 ns1.linuxmafia.com.]
balug.org.              86400   IN      SOA     ns1.balug.org.  
hostmaster.balug.org. 1521302930 9000 1800 1814400 86400 [64.62.190.98  
ns1.svlug.org.]
balug.org.              86400   IN      SOA     ns1.balug.org.  
hostmaster.balug.org. 1521302930 9000 1800 1814400 86400  
[2600:3c01::f03c:91ff:fe96:e78e ns1.svlug.org.]
balug.org.              86400   IN      SOA     ns1.balug.org.  
hostmaster.balug.org. 1521302930 9000 1800 1814400 86400  
[198.144.194.238 ns1.balug.org.]
balug.org.              86400   IN      SOA     ns1.balug.org.  
hostmaster.balug.org. 1521302930 9000 1800 1814400 86400  
[2001:470:1f04:19e::2 ns1.balug.org.]
$
Basically that one does a query against all the A and AAAA records of all
the NS servers for balug.org and gets their SOA RR with a *non*-recursive
query, and displays it in a one-line format, and tags each line with
the IP and NS name.  Again, not a *full* check, but example of a pretty
good check one can do.  In this particular case I was updating zones
(for letsencrypt.org validation by DNS(+DNSSEC)), and wanted to ensure that
all the authoritative nameservers had picked up the new data ... SOA check
alone didn't *fully* test that, but was a "good enough" test, that, short of
other major issues, the nameserver, having picked up the new SOA (notably
serial number), likely picked up the entire zone just fine).  Anyway,
folks/scripts/programs will often do similar checks ... e.g. see that all
the namservers are responding, see that the SOA serial is either current
or at least doesn't lag too far behind, check that they respond properly
to TCP, often check on many critical/important DNS RRs to ensure they don't
go missing or deviate from expected values, etc.  So, e.g., sometimes I've
dropped slave nameservers for chronic misbehavior/flakiness (e.g. one I
recall also always advertised IPv6, but *never* responded on IPv6).

Oh, ... also good to note and save contact information for DNS nameservers,
preferably including multiple means, e.g. email, phone, alternate contact
means/numbers/addresses, etc. - and store them somewhere they can be found
years later ... like as comments in zone master files (or for slaves, as
comments in the relevant configuration files).  *Otherwise*, 2/3/5/8 years
later, when an issue or question or need to notify comes up ... gee, who's
the contact and *where* was that noted those several or more years ago.
... reminds me also, ... good general rule, store it where it can be
logically found - even years later.  Some of us remember drives, and
terminating resister packs.  Sometimes you needed 'em on (end of chain),
sometimes not (not end of chain) ... but that situation could also flip,
years later ... oh, ... where's that resistor pack that came with the
drive that was installed 4 years ago?  Uhm ... yeah.  Tape it to the
drive, as close as feasible to where the resistor pack plugs in, and also
use appropriate type of tape, so it won't bleed or fall off in the
intervening time (learned that many decades ago, from some folks already
much more experienced at it than I was at that time).  Same general
practices when disassembling things - if one might ever want to
reassemble them - keep the relevant parts/components, etc. together.
So, ... in general, put it where one would logically go look for it -
even if it's years later, even if someone else needs to do the looking,
even if you can't be bothered to tell them where to look or don't remember
where it was put.

> Many people doing authoritative DNS are blissfully unaware of doing it
> badly because it's highly robust and forgiving of most ineptitude.
> E.g., I tried to help one fellow (SVLUG's VP at the time) by telling him
> his nameservice was dangerously thin because he had only two
> authoritative nameservers listed for the domain, and one of them,
> queried using dig, turned out to be malfunctioning.  He said 'What's the
> big deal?  My domain still works.'  He'd evidently been putting up with
> that and similar situations for years.  From his perspective, his
> Internet presence hadn't failed catastrophically yet, so, so far, so
> good.
>
> RFC 1912 recommends minimum three, maximum seven.  Two is the
> minimum _permitted_ (RFC 1918 section 2.8:  'You are required to have at
> least two nameservers for every domain, though more is preferred), and
> domain owners should always set three as the smallest amount of
> redundancy to settle for.  I personally go with five as a good middle
> value.

Well, nowadays, to *some* extent, ... that kind'a depends.  There are
various ways to do, e.g. high availability with network stuff and routing
even with a single IP.  But yeah, DNS, ... don't do *just* a single IP for
authoritative nameserver, ... because, well, Mr. Murphy and his laws are
alive and well and often quite practiced.  So, a highly redundant single
IP won't do diddly for ya' when that one IP still fails anyway.  So
yes, minimum two.  Beyond that ... really how important and what's your
risk model ... usually three (or more is good).  "Too many" can also be
problematic.  And three (to five or a bit more) that are rock solid (or
nearly so), is often significantly better than five or six or seven or
more with several of 'em being sort'a kind'a mostly available except
when too frequently they're not for some fair bit.  And why might that
be?  Yes, sure, caching helps, ... but it's never 100% coverage.  And
what happens past that depends what's available at the time.  Namsevers
that don't respond add to latencies, as clients query, wait a bit, timeout
they try other nameserver ... the cumulative effect of that can sometimes
be bits of intermittent flakiness in observed DNS resolution, ... notably
too many combined failure --> failover latencies can add up in their times,
and depending upon the queries, this can result in recursive nameservers
ending up needing to return a SERVFAIL and/or only partial results ...
sometimes this clears in fairly quick order with later repeated queries,
or it may come and go for some bit.  Bandwidth latencies can also cause
issues or exacerbate DNS problems.  E.g. DNS server at end of a "slow"
DLS link? ... *Most* of the time not an issue, ... but when bandwidth
is saturated, and latencies quite high (e.g. many hundreds to a few
thousand miliseconds), then various DNS queries can start to fail - at
least intermittently - due to timeouts.

> I mentioned my peeve about most DNS admins' overuse of the CNAME record
> type.  I should perhaps explain what that's about.  A CNAME is an alias

Yes, not only CNAME records (which certainly also have their place), but
there are often various optimizations that can be done, e.g. not only
TTLs, but what related records are served from same zone and same
authoritative nameservers, so, e.g. nameserver may commonly respond with
not only answers, but also additional, often precisely anticipating what
the querying client would next need/want to know anyway, and even fitting
all that within a single UDP packet response.  So ... can almost be an
art to get it highly and fully optimized.  *Most* of the time not *that*
critical ... but certainly at least sometimes that can make or contribute
to significant to major performance enhancements - depending upon usage
scenario and application and other factors.  (Maybe O'Reilly ought have
yet another DNS book, "The art of DNS performance optimization" or something
like that ... could make for a fair sized book in-and-of-itself.)  Of course
some areas of DNS are pretty specialized ... e.g. very high volume DNS
servers ... such as root DNS servers, or major TLD (e.g. .com, many
country code TLDs, etc.) DNS servers - notably due to their relatively
extreme traffic volumes and major security considerations (very attractive
targets ... so if there's a major zero-day exploit, where is it likely to
first be aimed? ... yeah, those would be high on the list of targets of
most attackers, not to mention DDoS attacks, etc. ... similar also typcically
applies to many DNS service providers).