[sf-lug] fixed: Re: sf-lug.com master/primary DNS appears broken (SERVFAIL)

Rick Moen rick at linuxmafia.com
Mon Jul 24 23:20:58 PDT 2017


Quoting Michael Paoli (Michael.Paoli at cal.berkeley.edu):

> Looks like we probably have smoking gun, operator error 8-O :
> Jul 21 23:43:59 balug-sf-lug-v2 named[623]: dns_rdata_fromtext:
> /etc/bind/master/sf-lug.com:5: near 'SERIAL': syntax error
> And looking back wee bit, looks like a ; went missing between the serial
> number and the comment.
> And here we can more clearly see the booboo:
> # 2>>/dev/null rcsdiff -r1.35 sf-lug.com | head -n 5
> 5c5
> <                       1478187114      ; SERIAL ; date +%s
> ---
> >                      1500705690 SERIAL ; date +%s
> 22,23d21
> #

Aha.  I long ago observed, BIND9 _seems_ to have absolutely abysmal
linting facilities for syntax errors in the conffiles and zonefiles --
until you look deeper.  System logs should have correctly recorded the
fact of the zonefile being marked invalid and not to be served
prosectively (thus the 'SERVFAIL' rcode in response to queries about
specifically that domain -- but you could easily miss that unless
watching /var/log/daemon.log (or equivalent) after an edit and zone
reload.

The reason I say _seems_ is that linting is available, just not where
you expect.  It's in ancillary utilities named-checkzone and
named-checkconf.

You probably think 'I guess I should use named-checkzone to lint zones
after editing, but no, you really want to use named-checkconf -- because
it _very_ nicely recurses through all your BIND9 conffile snippets
(include files and all), _and_ lints the referenced zonefiles _and_ Does
The Right Thing if you have chrooted BIND.

Two jobs ago, I was at a place with an absolutely enormous stack of
hundreds of domains served by BIND9.  Having downtime of the external
nameservers because of a mysterious syntax error was a big deal.  So I
wrote a standard recipe to check all that before restarting BIND or
reloading anything, something like this:

 /usr/sbin/named-checkconf -z -t /var/named/chroot/ /etc/named.conf | \
 egrep -v '(loaded serial|all zones must be in views)'

At the time, I reseached the BIND9 source code's error trapping (used by
named-checkconf) and made a much longer egrep -v list, everything that I
felt was basically linting noise and should be ignored.  One of my 
colleagues inverted the grep, though, and so we did this:

/usr/sbin/named-checkconf -z -t /var/named/chroot/ /etc/named.conf | \
egrep 'missing|not allowed|unknown|not at top of zone|\
appears to be an address|no current owner name|MAXTTL|file not found|\
may not be used with|outside epoch|in future|invalid|unsupported|no TTL|\
ignoring| TTL set to prior TTL' | sort -u

At the time, I was grateful for my colleague's change, as it made the 
command shorter than mine was with its even-longer 'egrep -v' list.
Thinking back on this, though, I think I was right and my colleague
wrong, because of something he didn't think of:  New warning or error
text added with a later BIND9 release.  My version would have made those
stand out; his would suppress them.

BTW, make sure you run 'rndc' with no parameters or options and read the
help text displayed:  They've added some really useful command modes for 
re-parsing the conffiles and zonefiles, including reloading the
zonefiles selectively.  If you haven't looked at the feature set in a
long while, you may be pleasantly surprised.

One of the reasons zonefile or BIND9 conffile syntax errors were so
disastrous at my old firm is that, back then, the only way to remove a
zone or add a new zone to BIND9 was to restart the BIND9 service.  If 
restart failed because of a particularly critical syntax error, your
whole nameservice remained down until you found it, fixed it, and
started again.


> I'm suspecting
> # systemctl reload bind9.service
> was rather quiet about it, whereas a more "old school" bind reload may
> have said more.

Before I discovered the linting facilities (particularly
named-checkconf), standard procedure if restarting BIND9 was to have a
terminal open with 'tail -f /var/log/daemon.log' to spot invalidated
zones.

But if you make a point of running named-checkconf before restarting
BIND, you can catch those before they happen.

_And_, the newer, fine-control abilities of the 'rndc' utility are a
much smarter approach than restarting all of BIND9.  99% of the reasons
for restarting all of BIND9 have been eliminated, and if you're still
doing that, you need to read up on rndc, which will make your life
better.

> 



More information about the sf-lug mailing list