[sf-lug] BIND DNS admin Re: fixed: Re: sf-lug.com master/primary DNS appears broken (SERVFAIL)

Wed Jul 26 02:13:03 PDT 2017

Ah, excellent points.  :-)

I got sloppy 8-O ... I think brain was tired and thinkin' something
like non-canonical, not important, who cares ... or otherwise forgot
to check, and ... oops.  Yeah, if it's active and in the configs
some person(s) and/or thing(s) care, and it ought be checked!

named-checkzone I'd used quite a bit in past ... but not so much
recently.
named-checkconf - ah, excellent one - I was unaware (or had forgotten).
And good point on rndc ... I used to use it *much* more regularly
when dealing with many hundreds of production zones, and yes,
peeking at it again, some nice newer capabilities have been added.

As for named-checkconf :-) just added...
# expand -t 4 < ~/bin/Named-checkconf
#!/bin/sh

rc=0

t=$(mktemp) || exit

for sig in 1 2 3 15
do
     trap '
         rm "$t"
         trap - "$sig"
         kill -"$sig" "$$"
     ' "$sig"
done

trap '
     rm "$t" || rc="$?"
     trap - 0
     exit "$rc"
' 0

named-checkconf -j -z -t /var/lib/named /etc/bind/named.conf \
> "$t" 2>&1 ||
rc="$?"
named-checkconf -t /var/lib/named /etc/bind/bind.keys >> "$t" 2>&1 ||
rc="$?"
named-checkconf -t /var/lib/named /etc/bind/rndc.key >> "$t" 2>&1 ||
rc="$?"
< "$t" grep -v '^zone [^ ][^ ]*/IN: loaded serial [1-9][0-9]*$'
[ "$?" -eq 1 ] || { [ "$rc" -ne 0 ] || rc=1; }

# trap ... 0 handles our cleanup and exit value

I might change the RE if I ever expect zone serial number of 0,
but for my data set, I'm not - at least currently - regularly expecting
that.

Also, good/excellent as named-checkconf may well be, *still* highly
prudent to check all worked fine after reload or the like.  It's
always possible, even if the syntax is fine, that something else
could *possibly* go wrong with reload or such, that may not be
caught by named-checkconf or may be entirely independent of
syntax (e.g. some other host or resource issue or whatever,
or variation in behavior between named and named-checkconf
interpretation, or state differences that cause issue with
named but not named-checkconf (e.g. what were and are serial
numbers in the zone files - named-checkconf only knows what they *are*).

In production past, had full staging DNS nameserver - any and all
changes were first made and validated there.  All the configurations
were under version control.  And when deployed to production,
results were also validated there for all nameservers.
We even had a "cooperative locking mechanism" to prevent multiple
DNS admins from stepping on each other's toes by potentially
simultaneously making possibly conflicting changes unbeknownst
to each other (involved a LOCK file in both version control
and in directory where the zone files were - at least if I'm
remembering the location correctly ... that lock file could also
be non-empty plain text indicating, e.g. who was working on the
change(s), general scope of changes, and when they estimated being
done with the changes).  Anyway, worked pretty dang well (I'd
built and rebuilt much of that infrastructure, and also documentation
and procedures).

And yes, bind ... perhaps rightly so, is relatively unforgiving of
syntax errors.  Looks like current versions are slightly more
forgiving.  And yes, booboo I'd occasionally see of the more junior
DNS admins would be something like:
edit zone file or the like, reload, changes not seen in served data,
try a restart to see if that fixes it, bind refuses to start due to
syntax error(s).  Ah, ... at least I think all the times I saw that
booboo, it was <= smallish departmental/group server, not larger
scale production DNS (gotta have 'em cut their teeth somewhere).
On relatively rare occasion, seen folks make fairly significant
errors with fairly large scale production DNS.  Some errors I
remember seeing: wrong data set (e.g. loading external DNS to
internal), serial number booboos, data booboos (sometimes the
specified changes to the data are perfectly valid syntax, but
wrong data), syntax booboos, errors - notably ignorance or incorrect
presumptions - regarding TTLs and caching and what will/won't be
fully visible how fast to all applicable locations, ... probably
some others too, but that's most of what jumps to mind, and probably
most or all of the more common ones.  Sometimes for different DNS
infrastructures there would or could be other issues/problems too,
e.g. like a screw-up to the database that's used for whatever
DNS server software.

> From: "Rick Moen" <rick at linuxmafia.com>
> Subject: Re: [sf-lug] fixed: Re: sf-lug.com master/primary DNS  
> appears broken (SERVFAIL)
> Date: Mon, 24 Jul 2017 23:20:58 -0700

> Quoting Michael Paoli (Michael.Paoli at cal.berkeley.edu):
>
>> Looks like we probably have smoking gun, operator error 8-O :
>> Jul 21 23:43:59 balug-sf-lug-v2 named[623]: dns_rdata_fromtext:
>> /etc/bind/master/sf-lug.com:5: near 'SERIAL': syntax error
>> And looking back wee bit, looks like a ; went missing between the serial
>> number and the comment.
>> And here we can more clearly see the booboo:
>> # 2>>/dev/null rcsdiff -r1.35 sf-lug.com | head -n 5
>> 5c5
>> <                       1478187114      ; SERIAL ; date +%s
>> ---
>> >                      1500705690 SERIAL ; date +%s
>> 22,23d21
>> #
>
> Aha.  I long ago observed, BIND9 _seems_ to have absolutely abysmal
> linting facilities for syntax errors in the conffiles and zonefiles --
> until you look deeper.  System logs should have correctly recorded the
> fact of the zonefile being marked invalid and not to be served
> prosectively (thus the 'SERVFAIL' rcode in response to queries about
> specifically that domain -- but you could easily miss that unless
> watching /var/log/daemon.log (or equivalent) after an edit and zone
> reload.
>
> The reason I say _seems_ is that linting is available, just not where
> you expect.  It's in ancillary utilities named-checkzone and
> named-checkconf.
>
> You probably think 'I guess I should use named-checkzone to lint zones
> after editing, but no, you really want to use named-checkconf -- because
> it _very_ nicely recurses through all your BIND9 conffile snippets
> (include files and all), _and_ lints the referenced zonefiles _and_ Does
> The Right Thing if you have chrooted BIND.
>
> Two jobs ago, I was at a place with an absolutely enormous stack of
> hundreds of domains served by BIND9.  Having downtime of the external
> nameservers because of a mysterious syntax error was a big deal.  So I
> wrote a standard recipe to check all that before restarting BIND or
> reloading anything, something like this:
>
>  /usr/sbin/named-checkconf -z -t /var/named/chroot/ /etc/named.conf | \
>  egrep -v '(loaded serial|all zones must be in views)'
>
> At the time, I reseached the BIND9 source code's error trapping (used by
> named-checkconf) and made a much longer egrep -v list, everything that I
> felt was basically linting noise and should be ignored.  One of my
> colleagues inverted the grep, though, and so we did this:
>
> /usr/sbin/named-checkconf -z -t /var/named/chroot/ /etc/named.conf | \
> egrep 'missing|not allowed|unknown|not at top of zone|\
> appears to be an address|no current owner name|MAXTTL|file not found|\
> may not be used with|outside epoch|in future|invalid|unsupported|no TTL|\
> ignoring| TTL set to prior TTL' | sort -u
>
> At the time, I was grateful for my colleague's change, as it made the
> command shorter than mine was with its even-longer 'egrep -v' list.
> Thinking back on this, though, I think I was right and my colleague
> wrong, because of something he didn't think of:  New warning or error
> text added with a later BIND9 release.  My version would have made those
> stand out; his would suppress them.
>
> BTW, make sure you run 'rndc' with no parameters or options and read the
> help text displayed:  They've added some really useful command modes for
> re-parsing the conffiles and zonefiles, including reloading the
> zonefiles selectively.  If you haven't looked at the feature set in a
> long while, you may be pleasantly surprised.
>
> One of the reasons zonefile or BIND9 conffile syntax errors were so
> disastrous at my old firm is that, back then, the only way to remove a
> zone or add a new zone to BIND9 was to restart the BIND9 service.  If
> restart failed because of a particularly critical syntax error, your
> whole nameservice remained down until you found it, fixed it, and
> started again.
>
>
>> I'm suspecting
>> # systemctl reload bind9.service
>> was rather quiet about it, whereas a more "old school" bind reload may
>> have said more.
>
> Before I discovered the linting facilities (particularly
> named-checkconf), standard procedure if restarting BIND9 was to have a
> terminal open with 'tail -f /var/log/daemon.log' to spot invalidated
> zones.
>
> But if you make a point of running named-checkconf before restarting
> BIND, you can catch those before they happen.
>
> _And_, the newer, fine-control abilities of the 'rndc' utility are a
> much smarter approach than restarting all of BIND9.  99% of the reasons
> for restarting all of BIND9 have been eliminated, and if you're still
> doing that, you need to read up on rndc, which will make your life
> better.