[sf-lug] Fixed: Re: DNS: sf-lug.com. "down": NS 208.96.15.252 "broken"

Sat Feb 28 14:05:09 PST 2009

Quoting Michael Paoli (Michael.Paoli at cal.berkeley.edu):

> Fixed, details towards the tail end of:
> http://www.sf-lug.com/log.txt

Short version:  The instance of BIND9 that served DNS nameservice as
ns1.sf-lug.com was not set to launch at boot time.  Michael fixed error,
and also started a runtime BIND9 instance.  And then, the most important
thing:

> $ dig @208.96.15.252 -t A sf-lug.com. +short
> 208.96.15.252
> $ dig @208.96.15.252 -t A sf-lug.com. +short +tcp
> 208.96.15.252
> $

Something I have to keep teaching people working for and with me (but
not, of course, Michael):  The only relevant test of a hardware or
software system is whether it carries out the function(s) for which it
exists.  BIND9 on 208.96.15.252 is there to (1) answer queries, and (2) 
provide zone transfers to its secondaries (slaves).

Michael's test, above, verifies that the machine is answering queries
_both_ using the default UDP transport and using TCP.  As mentioned
elsewhere, most DNS is over UDP packets, but TCP is required for (a)
zone transfers and (b) any responses whose total length exceeds 512
bytes, which some do.

This fact is significant to note, in part, because it's _very_ common
for syadmins unaware of that wrinkle to commit the bonehead error of
firewalling off port 53/tcp, thinking "All I have to do is allow 53/udp,
to let my nameserver do its job."  Wrong.  Doing that permits almost all
queries to go through, but with a few mysterious failures, and makes all
zone transfers between master and slave nameservers fail.

The other common bonehead error for firewalling DNS goes as follows: 
Someone's looking through BIND9 docs (usually for a nameserver in a
network DMZ zone or behind the corporate firewall), and sees a
"query-source" directive that can be used in the conffile to "lock" BIND9
to originating all queries to originate from port 53 (the dedicated
TCP/IP port for DNS queries).  This admin then thinks, "Oh, good,
that'll simplify my firewalling:  I can block all outgoing traffic 
except from identified services such as port 53 for DNS."  The admin
writes an iptables rule, verifies that DNS seems to be still working,
and calls it good.

_Really_ bad idea.  Modern nameservers _deliberately_ randomise query
ports for security reasons.  Going out of your way to disable that
feature makes your nameserver highly vulnerable to cache-poisoning
attacks.  So, don't use "query-source".

And do _not_ let anyone loose with root-user access and a desire to
write iptables rules, without verifying that he/she has a clue about
what the essential services actually need.  I've seen more novice
sysadmin shoot themselves with the foot doing firewalling than just
about everything else combined.

Getting back to Michael's checks, there remained point #2, zone 
transfers, which he checked indirectly, as follows:

> $ dig @198.144.195.186 -t A sf-lug.com. +short
> 208.96.15.252
> $ dig @198.144.195.186 -t A sf-lug.com. +short +tcp
> 208.96.15.252
> $

(As a reminder, IP "198.144.195.186" is the secondary = slave
nameserver for domain sf-lug.com, _my_ nameserver.)

Michael verified that the slave nameserver, likewise, is now able to
resolve "sf-lug.com", using both UDP and TCP query types.  This
_indirectly_ confirms that the slave must have successfully pulled down
a fresh copy of the zone from the master recently, because (as we know
from upthread), as of yesterday the slave had expired out the copy of
the data it had on file from a couple of weeks ago, its Time to Live
(TTL) having expired without any zone transfers occurring to refresh the
data from the master.

Michael could have checked _directly_ for that zone transfer by looking
in the master server's /var/log/messages file, where he'd have seen the
master sending out a NOTIFY signal (the "Hey, slave nameservers, there's
a zone being freshly loaded (e.g., because the DNS daemon just started)
or revised for you to pick up" notice that masters send to slaves, as
part of the DNS protocols), and then the record of 198.144.195.186
pulling down the zone.

Checking the slave nameserver's ability to answer queries is good;
checking for the zone transfer's occurrence directly is also good.

Anyway, I should add:  Just two nameservers is a bad idea.  Best
practices per the RFCs is _minimum_ three recommended, maximum seven.
Admittedly, sf-lug.com would not have been saved from downtime by a
second slave, given that y'all failed to notice for a couple of week
your master nameserver being offline, but achieving at least the
recommended level of redundancy will save you from most other types of
outages.

I can offer you SVLUG's nameserver as a second slave.  NS1.SVLUG.ORG, IP
64.62.190.98.  Just add it to ns1.sf-lug.com's allowed-transfer ACL in
/etc/bind/named.conf, restart BIND9, and let me know.  I'll set up slave
nameservice and confirm that it can pull down zones and answer quereies,
and you then add it to the authoritative list.

You really should not keep trying to get by with only two.  Bad idea.
Really.

Ideally, you should (and, well, I should, too) also set up a little 
cronjob, using "dig" to query the master and all slaves for the SOA
record, and then parsing out the zonefile S/N using awk and making sure
all nameservers are reporting the same value -- and sending admins
e-mail if any of the machines isn't serving up data or isn't up-to-date.
Running that daily would more than suffice.  Shouldn't be difficult.

Why?  Because, as one learns the hard way, the guy who promised to do
secondary nameservice for you a year ago will often forget, shut it off,
and not bother to tell you.

Actually, in fact, that aforementioned cronjob, to make it _really_ 
useful, would also query either "whois" or the NS records inside the
parent zone's records (in this case, the .com TLD zone's
nameservers[1]), to find out whether the master and slaves are still
authoritative.  Because the other mishap that keeps occurring is:  A
year ago, you agreed to do secondary for a friend's domain, and have
been doing it faithfully.  One day, it somehow occurs to you to reverify
that your secondary service is still authoritative, and suddenly it's
not:  You've been doing secondary pointlessly for an unknown number of
months, because the bozo ceased using your service and failed to mention
that fact to you.

Last, Jim, have you considered graduating to something better than
BIND9?  Seriously.  It's a dreadful piece of code:  slow, RAM-grabbing,
overfeatured, and with a questionable security model.  It's no longer as
scandalously buggy as BIND8 was, having gone through a total rewrite,
but it's still scandalously bad.

I realise you're, in part, trying to teach aspiring sysadmins how to
wrangle the same terrible software they're likely to encounter in
industry, but, now that you've done that for a while and learned the
ropes, maybe you're ready to switch to something that doesn't suck.

If ns1.sf-lug.com is doing only authoritative service, e.g., the
machine's /etc/resolv.conf doesn't point to it for general ("recursive")
nameservice, then look no further than NSD.  I can send you an example
setup, as it's what we use on NS1.SVLUG.ORG.