[sf-lug] SF-LUG.COM. DNS & what happened in 2009-02

Tue Nov 17 00:11:59 PST 2009

For those that might not have seen it earlier, or might want a
refresher, here's information on a relatively similar problem
from earlier this year.  Scenario was a bit different then,
but has quite a bit in common with the current situation.

For the details from earlier, have a read/skim through:
http://linuxmafia.com/pipermail/sf-lug/2009q1/006424.html
http://linuxmafia.com/pipermail/sf-lug/2009q1/006426.html
http://linuxmafia.com/pipermail/sf-lug/2009q1/006429.html
http://linuxmafia.com/pipermail/sf-lug/2009q1/006430.html
http://linuxmafia.com/pipermail/sf-lug/2009q1/006431.html
http://linuxmafia.com/pipermail/sf-lug/2009q1/006432.html
http://linuxmafia.com/pipermail/sf-lug/2009q1/006437.html
http://linuxmafia.com/pipermail/sf-lug/2009q1/006446.html
And for the stuff referencing either of these URLs:
http://208.96.15.252/log.txt
http://www.sf-lug.com/log.txt
just have a look at that earlier bit of data, reproduced below.

2009-02-27
Sometime today, folks started noticing problems with DNS for sf-lug.com., e.g:
http://linuxmafia.com/pipermail/sf-lug/2009q1/006424.html
et. seq.

2009-02-28 mpaoli
I noticed the DNS problems with sf-lug.com., and also found
http://linuxmafia.com/pipermail/sf-lug/2009q1/006424.html
et. seq.
# fuser -n tcp 53
here: 53
53/tcp:               3319
# ps lwwwwwwwwwp 3319
F   UID   PID  PPID PRI  NI   VSZ  RSS WCHAN  STAT TTY        TIME COMMAND
1 10741  3319     1  19   0 47800 2712 rt_sig Ssl  ?          0:00  
/usr/local/sbin/named-balug -u balugdns -c /etc/named-balug.conf -t  
/var/named/chroot-balug
... but BALUG DNS usually only listens on a different IP (208.96.15.254) -
for sf-lug.com. we're interested in: 208.96.15.252
# netstat -an | grep ':53 .*LISTEN'
tcp        0      0 208.96.15.254:53            0.0.0.0:*               
      LISTEN
... so, nothing listening (sf-lug.com. DNS down) on 208.96.15.252 port 53
# uptime
  04:17:54 up 14 days, 10:36,  9 users,  load average: 0.00, 0.00, 0.00
Gee, I wonder if bind didn't restart and if the zone had an expire of 14 days?
That would explain a lot.
confirm *nix flavor
# cat /etc/redhat-release
CentOS release 4.4 (Final)
# ls /etc/init.d/*named*
/etc/init.d/named  /etc/init.d/named-balug
# chkconfig --list | fgrep named | fgrep -v balug
named           0:off   1:off   2:off   3:off   4:off   5:off   6:off
... that explains a lot ...
# ls -ld /etc/*named*conf*
lrwxrwxrwx  1 root root 44 May 12  2007 /etc/named-balug.conf ->  
/var/named/chroot-balug/etc/named-balug.conf
lrwxrwxrwx  1 root root 32 Mar  5  2007 /etc/named.conf ->  
/var/named/chroot/etc/named.conf
# rpm -qa | fgrep -i bind
bind-libs-9.2.4-24.EL4
bind-9.2.4-24.EL4
bind-chroot-9.2.4-24.EL4
bind-utils-9.2.4-24.EL4
ypbind-1.17.2-8
# ls -ld /var/named/chroot/etc/named.conf
-rw-r--r--  1 root named 1853 May 14  2007 /var/named/chroot/etc/named.conf
# ls -ld /var/named/chroot/var/named/sf-lug.com
-rw-r--r--  1 root root 440 Oct 29  2007  
/var/named/chroot/var/named/sf-lug.com
# ls -ldu /var/named/chroot/var/named/sf-lug.com
-rw-r--r--  1 root root 440 Nov  8  2007  
/var/named/chroot/var/named/sf-lug.com
# (cd /var/named/chroot/var/named && df -k .)
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/md0               9612516   3450096   5674128  38% /
# mount | fgrep md0
/dev/md0 on / type ext3 (rw)
#
Checked that the filesystem containing the sf-lug.com zone file that's
presumably the one we're interested in isn't mounted ro or noatime (otherwise
the atime of the file wouldn't be too useful/informative in this case).
... looks like bind hasn't (re)read that file in quite a while.  I'm
probably looking at correct file - but haven't confirmed the init config
bits to see if it's using that chroot location ... though it likely is.
Let's see if relevant restart fixes it and confirms all those bits,
but first ...
# head /var/named/chroot/var/named/sf-lug.com
$TTL 86400
$ORIGIN sf-lug.COM.
@       IN      SOA     ns1.sf-lug.com. jim.well.com. (
                         2007102904      ;Serial
                         3600            ;refresh period
                         3600            ;retry period
                         1209600         ;expire period
                         10800)          ;minimum TTL period
;
         IN      NS      ns1.sf-lug.com.
# echo '1209600/3600/24' | bc -l
14.00000000000000000000
#
Yup, ... 14 day expiration, as I suspected.
# (umask 022 && chkconfig named on)
... umask 022 - I don't trust Red Hat (and thus CentOS) quite enough for it to
always do the right thing ... so ... 022 for something that may  
install/modify,
and where I don't want the permissions to end up too tight where such isn't
desired.
# chkconfig --list | fgrep named | fgrep -v balug
named           0:off   1:off   2:on    3:on    4:on    5:on    6:off
# (cd / && umask 022 && service named start)
Starting named:                                            [  OK  ]
# netstat -an | grep ':53 .*LISTEN'
tcp        0      0 208.96.15.252:53            0.0.0.0:*               
      LISTEN
tcp        0      0 127.0.0.1:53                0.0.0.0:*               
      LISTEN
tcp        0      0 208.96.15.254:53            0.0.0.0:*               
      LISTEN
# ls -ldu /var/named/chroot/var/named/sf-lug.com
-rw-r--r--  1 root root 440 Feb 28 04:30  
/var/named/chroot/var/named/sf-lug.com
# date
Sat Feb 28 04:30:23 PST 2009
That looks much better ... and nice fresh access time (from (re)start and
hence (re)read ... and more recent than when I otherwise read the file - so
likely I was looking at the correct zone file).  And the acid test ... does it
work?  From elsewhere on the Internet:
$ dig @208.96.15.252 -t A sf-lug.com. +short
208.96.15.252
$ dig @208.96.15.252 -t A sf-lug.com. +short +tcp
208.96.15.252
$
Looks good!
... from earlier peek at SOA, we have 3600 for retry ... so, at worst case,
slave should be all better within an hour.
... and already, slave looks good:
$ dig @198.144.195.186 -t A sf-lug.com. +short
208.96.15.252
$ dig @198.144.195.186 -t A sf-lug.com. +short +tcp
208.96.15.252
$
... likely from BIND >=8 "notify"
... peeking again at the named.conf file, and the zone file, we see the slave
listed as an NS for the zone, and we find nothing in the named.conf that
would prevent BIND from sending notify to the slave, so the master likely
did so, and thus the slave would have recovered much more quickly.