[sf-lug] fixed: Re: sf-lug.com master/primary DNS appears broken (SERVFAIL)

Michael Paoli Michael.Paoli at cal.berkeley.edu
Mon Jul 24 20:50:44 PDT 2017


And corrected.  :-)  Thanks for catching that.

And, the diagnosing, isolating, and fixing:
Since I happened to have access from the client side, I checked to
confirm:
$ hostname && dig @198.144.194.238 ns1.sf-lug.com. AXFR
linuxmafia.com

; <<>> DiG 9.4.2 <<>> @198.144.194.238 ns1.sf-lug.com. AXFR
; (1 server found)
;; global options:  printcmd
; Transfer failed.
$
Then quick look on server at logs:
# tail -n 2 /var/log/daemon.log
Jul 24 19:36:55 balug-sf-lug-v2 named[610]: client  
198.144.195.186#58623 (ns1.sf-lug.com): bad zone transfer request:  
'ns1.sf-lug.com/IN': non-authoritative zone (NOTAUTH)
Jul 24 19:37:35 balug-sf-lug-v2 named[610]: client  
198.144.195.186#37243 (ns1.sf-lug.com): bad zone transfer request:  
'ns1.sf-lug.com/IN': non-authoritative zone (NOTAUTH)
#
Odd/interesting ... what's changed lately on server?
# ls -ldc /usr/sbin/named
-rwxr-xr-x 1 root root 588888 Jul  8 09:33 /usr/sbin/named
#
... not that recent ... any other highly recent software changes?
nothing relevant since 2017-07-08 (per our local
/var/local/log/log and also confirmed with /var/log/dpkg.log)
system generally complaining about anything else? - nothing obvious
any full filesystems? no
I believe transfer is set open from any IP, does it also
fail from 127.0.0.1? - yes:
$ dig @127.0.0.1 sf-lug.com. AXFR

; <<>> DiG 9.9.5-9+deb8u12-Debian <<>> @127.0.0.1 sf-lug.com. AXFR
; (1 server found)
;; global options: +cmd
; Transfer failed.
$
Let's look back through logs to see when named first stared
throwing these errors - and last prior good xfer:
Jul  8 15:44:30 balug-sf-lug-v2 named[26994]: client  
198.144.195.186#48245 (sf-lug.com): transfer of 'sf-lug.com/IN': IXFR  
started
Jul  8 15:44:30 balug-sf-lug-v2 named[26994]: client  
198.144.195.186#48245 (sf-lug.com): transfer of 'sf-lug.com/IN': IXFR  
ended
Jul 24 19:31:20 balug-sf-lug-v2 named[610]: client  
198.144.195.186#33221 (ns1.sf-lug.com): bad zone transfer request:  
'ns1.sf-lug.com/IN': non-authoritative zone (NOTAUTH)
Let's see when zone file last changed and if something is mucked up
with it:
# cd /etc/bind/master && pwd -P
/var/lib/named/etc/bind/master
# ls -ld sf-lug.com
-r--r----- 1 root bind 1103 Jul 21 23:43 sf-lug.com
Permissions loog good, but mtime within our suspect range.
# fgrep -i serial sf-lug.com
                         1500705690 SERIAL ; date +%s
#
We've got "in-line" BIND 9 DNSSEC signing going on, so server may give
out slightly "higher" serial number, but it ought not be "lower" than
what's in the master zone file:
# fgrep -i serial sf-lug.com
                         1500705690 SERIAL ; date +%s
# dig +noall +norecurse +answer @127.0.0.1 sf-lug.com. SOA
#
Our lack of response on the latter would typically indicate something is
borked with the zone.
# uptime
  19:56:34 up 23:58,  4 users,  load average: 0.00, 0.04, 0.05
#
Server host was rebooted not all that horribly long ago, so very possible
BIND effectively "masked" the issue (didn't load flawed zone) ...
until reboot, when it no longer had good older data in cache, nor would
it load bad zone file (working hypothesis, anyway)
if something is wrong with zone file, that ought at least have shown
in logs when reload was attempted ...
# stat -c '%y %n' sf-lug.com
2017-07-21 23:43:08.000000000 -0700 sf-lug.com
#
(with GNU ls, there's also the --full-time option, but the stat was much
more concise in what I wanted - mtime to at least the second)
... we'll look around then to see what named says in the logs about sf-lug.com
Looks like we probably have smoking gun, operator error 8-O :
Jul 21 23:43:59 balug-sf-lug-v2 named[623]: dns_rdata_fromtext:  
/etc/bind/master/sf-lug.com:5: near 'SERIAL': syntax error
And looking back wee bit, looks like a ; went missing between the serial
number and the comment.
And here we can more clearly see the booboo:
# 2>>/dev/null rcsdiff -r1.35 sf-lug.com | head -n 5
5c5
<                       1478187114      ; SERIAL ; date +%s
---
>                       1500705690 SERIAL ; date +%s
22,23d21
#
I'm suspecting
# systemctl reload bind9.service
was rather quiet about it, whereas a more "old school" bind reload may
have said more.
# (cd / && umask 022 && systemctl reload bind9.service)
gives a return code of 0 and writes nothing to stderr, despite the
bad data in the zone file.
# (cd / && umask 022 && /etc/init.d/bind9 reload)
Reloading bind9 configuration (via systemctl): bind9.service.
#
Well, that's not so useful, it just uses systemctl to do the work anyway.
#
Anyway, clear enough in the logs - should've confirmed the zone loaded
okay by the served DNS data, and/or seeing successful non-eroneous load
in the logs.  My bad for apparently having not checked on that.
Anyway, fixing ...
# date +%s && co -l -M sf-lug.com && ex +/SERIAL/ sf-lug.com
1500952261
RCS/sf-lug.com,v  -->  sf-lug.com
revision 1.36 (locked)
done
sf-lug.com: unmodified: line 5
:.
                         1500705690 SERIAL ; date +%s
:s/1500705690/1500952261        ;/
                         1500952261      ; SERIAL ; date +%s
:w
sf-lug.com: 28 lines, 1105 characters
:q
# (cd / && umask 022 && systemctl reload bind9.service)
#
"Acid" test - see if the change propogated:
... no, not even working locally yet ...
let's look more closely at what changed relative to earlier
good zone file:
... hmmm, comparing by eyeball, it's just not jumping out at me,
let's see what the logs complain about where ...
Jul 24 20:26:21 balug-sf-lug-v2 named[610]: zone sf-lug.com/IN  
(unsigned): journ
al rollforward failed: journal out of sync with zone
So, looks like now we had bad zone "too long" and are out-of
sync with our automagic DNSSEC ... probably need bit of manual fix ...
# rm sf-lug.com.jnl
# (cd / && umask 022 && systemctl reload bind9.service)
$ dig @198.144.195.186 +noall +norecurse +answer sf-lug.com. SOA
sf-lug.com.             86400   IN      SOA     ns1.sf-lug.com.  
jim.well.com. 1500953504 10800 3600 1209600 10800
$
And all is well again - slave picked up the zone fine.
And to check in (our this time good) change:
# ci -u -M -d -m'corrected botched syntax that broke zone file' sf-lug.com
RCS/sf-lug.com,v  <--  sf-lug.com
new revision: 1.37; previous revision: 1.36
done
#

> From: "Michael Paoli" <Michael.Paoli at cal.berkeley.edu>
> Subject: Re: sf-lug.com master/primary DNS appears broken (SERVFAIL)
> Date: Mon, 24 Jul 2017 19:25:06 -0700

> 8-O
>
> Thanks for the notification, looking into it.
>
>> From: "Rick Moen" <rick at linuxmafia.com>
>> Subject: sf-lug.com master/primary DNS appears broken (SERVFAIL)
>> Date: Mon, 24 Jul 2017 18:56:12 -0700
>
>> Throughout today, logcheck has been nagging me about failures of
>> AXFR/IXFR requests from my secondary DNS for sf-lug.com to the domain
>> primary nameserver,
>>
>>
>> ----- Forwarded message from logcheck system account  
>> <logcheck at linuxmafia.com> -----
>>
>> Date: Mon, 24 Jul 2017 18:02:02 -0700
>> From: logcheck system account <logcheck at linuxmafia.com>
>> To: root at linuxmafia.com
>> Subject: linuxmafia.com 2017-07-24 18:02 System Events
>>
>> System Events
>> =-=-=-=-=-=-=
>> Jul 24 17:53:23 linuxmafia named[16805]: zone sf-lug.com/IN:  
>> refresh: unexpected rcode (SERVFAIL) from master 198.144.194.238#53  
>> (source 0.0.0.0#0)
>>
>>
>> ----- End forwarded message -----
>>
>>
>> Checking by direct examination:
>>
>> $ dig -t soa ns1.sf-lug.com @NS1.SF-LUG.COM +short
>> $ dig -t soa ns1.sf-lug.com @NS1.SF-LUG.COM
>>
>> ; <<>> DiG 9.4.2 <<>> -t soa ns1.sf-lug.com @NS1.SF-LUG.COM
>> ;; global options:  printcmd
>> ;; Got answer:
>> ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 34122
>> ;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
>> ;; WARNING: recursion requested but not available
>>
>> ;; QUESTION SECTION:
>> ;ns1.sf-lug.com.                        IN      SOA
>>
>> ;; Query time: 339 msec
>> ;; SERVER: 198.144.194.238#53(198.144.194.238)
>> ;; WHEN: Mon Jul 24 18:45:48 2017
>> ;; MSG SIZE  rcvd: 32
>>
>> $
>>
>>
>> I conclude primary DNS service is currently broken at master (/primary)
>> nameserver NS1.SF-LUG.COM, IP 198.144.194.238.  Please be advised.




More information about the sf-lug mailing list