[sf-lug] DNS: sf-lug.com., general, and balug.org.

Sun Mar 1 15:12:23 PST 2009

DNS: sf-lug.com., general, and balug.org.

Rick, thanks for your excellent points, commentary, analysis, etc.

Some random points I'm inclined to add (for additional information,
clarification, exceptions (hey - us nerds love our exceptional and edge
cases :-> ... after all, that's where much of the interesting stuff
happens anyway (like pretty good stuff breaks, and the really good stuff
keeps working - or at least fails gracefully or in a "principle of least
surprise" manner if it must fail))).

> Date: Sat, 28 Feb 2009 14:05:09 -0800
> From: Rick Moen <rick at linuxmafia.com>
> Subject: Re: [sf-lug] Fixed: Re: DNS: sf-lug.com. "down": NS
>       208.96.15.252   "broken"
> To: sf-lug at linuxmafia.com
>
> Quoting Michael Paoli (Michael.Paoli at cal.berkeley.edu):
>
>> Fixed, details towards the tail end of:
>> http://www.sf-lug.com/log.txt
>
> Getting back to Michael's checks, there remained point #2, zone
> transfers, which he checked indirectly, as follows:
>
>> $ dig @198.144.195.186 -t A sf-lug.com. +short
>> 208.96.15.252
>> $ dig @198.144.195.186 -t A sf-lug.com. +short +tcp
>> 208.96.15.252
>> $
>
> (As a reminder, IP "198.144.195.186" is the secondary = slave
> nameserver for domain sf-lug.com, _my_ nameserver.)

If I'm not mistaken, since >= BIND 8, the current terminology is
master/slave ... primary/secondary was used for older versions of BIND.
ISC's use of master/slave may not have been the most PC (Politically
Correct) choice (but that's a whole other topic) but probably made for
a most logical choice (short, descriptive, relatively unambiguous,
commonly used and understood - including common use in technical
constructs), and perhaps unlike primary/secondary, master/slave also
works more logically for chaining relationships - e.g. a DNS server -
even for the same zone, can be both a master (to downstream slave(s)),
and a slave (to upstream master(s)) - primary/secondary doesn't seem to
fit quite as smoothly in a chaining construct (particularly if one
wants to avoid the messiness of adding tertiary, etc. into
descriptions, when in each individual comparative case, one generally
only needs to talk about the relationship between a pair of servers).

Why I jumped to checking queries against the slave - mostly a judgement
call and guestimate on efficient next step in troubleshooting (or
checking/confirming that all was well again).  Based on the evidence to
that point, I was guestimating that there had been effectively single
problem on sf-lug.com. (initialization wasn't configured to (re)start
BIND for sf-lug.com. master upon system (re)boot), and that I'd
corrected it (fixed that configuration big and (re)started BIND), and
things should be working from there - so I mostly just jumped to
checking results - the far end of the chain - are master and slave
working properly from the Internet - in the context of this particular
problem (e.g. evidence was that slave had expired the zone and had no
other master to pull the zone from), if all looked good at the results
end of things, I was going to presume we were essentially done.  Had
that not been the case (e.g. slave still not answering DNS for
sf-lug.com. even after waiting master zone retry of 3600 (1 hour)), I
would have dug further ... and looking at evidence for zone transfers,
notifies, and/or attempts thereof would have been a relatively logical
place to jump in (basic divide and conquer troubleshooting).

> Michael verified that the slave nameserver, likewise, is now able to
> resolve "sf-lug.com", using both UDP and TCP query types.  This
> _indirectly_ confirms that the slave must have successfully pulled down
> a fresh copy of the zone from the master recently, because (as we know
> from upthread), as of yesterday the slave had expired out the copy of
> the data it had on file from a couple of weeks ago, its Time to Live
> (TTL) having expired without any zone transfers occurring to refresh the
> data from the master.
>
> Michael could have checked _directly_ for that zone transfer by looking
> in the master server's /var/log/messages file, where he'd have seen the
> master sending out a NOTIFY signal (the "Hey, slave nameservers, there's
> a zone being freshly loaded (e.g., because the DNS daemon just started)
> or revised for you to pick up" notice that masters send to slaves, as
> part of the DNS protocols), and then the record of 198.144.195.186
> pulling down the zone.
>
> Checking the slave nameserver's ability to answer queries is good;
> checking for the zone transfer's occurrence directly is also good.

Yup, ... all good :-)
Key thing is to at least be sure and include checking end results -
yes, we think we fixed it - that's good ... but ... does it actually
*work*? ... in this case I jumped a bit ahead to look at the end pieces
first (and made some reasonable presumptions on the rest based upon
those results and the scenario and evidence at hand).  In a rather
different scenario (e.g. setting up new DNS across and through multiple
responsible entities) I'd likely be more inclined to start from the
head (primary* master), and test flow along the way, to make sure each
piece that should be working is ... all the way through out to desired
and expected end results (clients being able to resolve what they
should and from where they should, and updates work all the way through
the chain).
* in this case, by "primary master" I mean the server which is itself
master for the applicable zone(s), and not slave to some other server
for that(/those) zone(s).

> Anyway, I should add:  Just two nameservers is a bad idea.  Best
> practices per the RFCs is _minimum_ three recommended, maximum seven.

Yes, three or more (up to appropriate reasonable limit) is better.
Preferably, they should be geographically separated, and reasonably
separated (avoiding single points of failure) network-wise also ... at
least to the extent reasonably feasible.

Maximum ... well, depends, in certain cases that's as high as, but no
higher than 13.  But too many is quite bad.  The "certain cases" of 13,
is very short domain name ... like root servers or com., or a rather to
quite short TLD (museum. may be too long for 13).  That optimal maximum
is the most that are guaranteed to fit within a single UDP reply packet
without truncation.  For really long domain names, seven may be too
many.  So, ... what really bad happens with too many?  The complete DNS
response isn't guaranteed to all fit within a single UDP packet.  In a
case where it does all fit in a UDP packet, client typically sends a
UDP packet for the query, and receives a UDP packet with the complete
answer and without any truncation - pretty efficient.  When it isn't
guaranteed to fit, things get messy - the DNS server puts in what it
can (what it deems most important), and sets a flag indicating the
response is truncated, and sends that UDP packet.  The client, upon
seeing the truncation flag then reissues the query, using TCP ... now,
with TCP, there's a 3-way handshake just to set up the connection, ...
then the query, and then packet(s) containing the complete response,
... and then the tearing down of the TCP connection ... all *much* more
overhead for both client, server, and also network bits between.

To illustrate a bit more, here's example I wrote up earlier this month
to help explain to some coworkers (with some very slight <snip>page,
and a bit of reformatting to fold some longer lines) ... I also add a
wee bit more example/comment:

< From: Michael Paoli
< Sent: Thursday, February 12, 2009 7:09 PM
< To: <snip>
< Cc: <snip>
< Subject: Informational: RE: DNS: UDP & TCP (why it needs both, when it
< generally uses which, etc.)
<
< FYI, in case y'all are curious, ... here's case where DNS necessarily
< uses TCP in dealing with a query/response.
<
< Oh, ... and let's not forget also, those extra trips over <snip> ...for
< TCP, it's also a 3-way handshake just to get started - so that's several
< packets bouncing back and forth just to establish the connection -
< before that connection can actually be used to do useful work.  DNS
< mostly uses UDP where it can for short queries/replies ... but it can't
< always use UDP.
<
< Anyway, in (DNS) resolving www.forallthewaysyoucare.com.
< (http://forallthewaysyoucare.com/ redirects to
< http://www.forallthewaysyoucare.com/) ... by default things start with
< UDP queries, and responses, ... but if the response doesn't fit within
< one single UDP packet, the responding server just sends one UDP packet
< to the requesting client - but with a bit flag set indicating the result
< was truncated (and it puts what it considers the most important of the
< partial information in that UDP packet).  Under most normal
< circumstances, the DNS resolver client then reissues the query - but
< using TCP this time - so that it can then get the complete response (TCP
< is connection-oriented stream protocol, so arbitrarily large data can be
< sent back - this is also why DNS uses TCP for zone transfers - they
< almost always involve much larger set of data than typical DNS
< query/reply).  So, ... in resolving www.forallthewaysyoucare.com.,
< jumping fairly far into the process, client (or intermediary server)
< asks one of these IPs (they are the IPs of the authoritative nameservers
< for the domain forallthewaysyoucare.com.):
< 216.21.231.10 dns010.a.register.com.
< 216.21.232.10 dns010.b.register.com.
< 216.21.235.10 dns010.c.register.com.
< 216.21.236.10 dns010.d.register.com.
< this question: www.forallthewaysyoucare.com.  IN      A
< i.e. what is(/are) the A record(s) for www.forallthewaysyoucare.com.
< The responding server provides the answer (if it has it or is willing
< and able to give it), flag bit to indicate if it's authoritative or not
< ... and it also includes other data, e.g.  "authority" data and
< "additional" data - mostly it's just trying to be helpful here, handing
< out information that's also likely to be needed (like, oh, you asked for
< an A record, well, there's just a CNAME for that ... but by the way, I
< also happen to have the authority information for the domain that CNAME
< is in and the A records for those NS servers) ... so there may be slight
< to significant amount of additional data provided by the server - in
< this case it's a fair amount - and more than fits in a UDP reply.  If we
< ask, using dig, it informs us that the (UDP) result was
< malformed/truncated - and it then (by default) tries again with TCP.  If
< we tweak the options on dig a bit, we can tell it to only use UDP, and
< not retry, ... or just start with and use TCP.
<
< Here we have default use of dig - it gets us all the information - after
< it retries with TCP, telling us it had issues with UDP:
< $ 2>&1 dig @216.21.232.10 -t A www.forallthewaysyoucare.com.
< ;; Warning: Message parser reports malformed message packet.
< ;; Truncated, retrying in TCP mode.
<
< ; <<>> DiG 9.2.4 <<>> @216.21.232.10 -t A www.forallthewaysyoucare.com.
< ;; global options:  printcmd
< ;; Got answer:
< ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 21789
< ;; flags: qr rd; QUERY: 1, ANSWER: 1, AUTHORITY: 13, ADDITIONAL: 13
<
< ;; QUESTION SECTION:
< ;www.forallthewaysyoucare.com.  IN      A
<
< ;; ANSWER SECTION:
< www.forallthewaysyoucare.com. 14400 IN  CNAME
< forallthewaysyoucare.com.awarenessnetworks.com.edgesuite.net.
<
< ;; AUTHORITY SECTION:
< .                       518400  IN      NS      a.root-servers.net.
< .                       518400  IN      NS      b.root-servers.net.
< .                       518400  IN      NS      c.root-servers.net.
< .                       518400  IN      NS      d.root-servers.net.
< .                       518400  IN      NS      e.root-servers.net.
< .                       518400  IN      NS      f.root-servers.net.
< .                       518400  IN      NS      g.root-servers.net.
< .                       518400  IN      NS      h.root-servers.net.
< .                       518400  IN      NS      i.root-servers.net.
< .                       518400  IN      NS      j.root-servers.net.
< .                       518400  IN      NS      k.root-servers.net.
< .                       518400  IN      NS      l.root-servers.net.
< .                       518400  IN      NS      m.root-servers.net.
<
< ;; ADDITIONAL SECTION:
< a.root-servers.net.     3600000 IN      A       198.41.0.4
< b.root-servers.net.     3600000 IN      A       192.228.79.201
< c.root-servers.net.     3600000 IN      A       192.33.4.12
< d.root-servers.net.     3600000 IN      A       128.8.10.90
< e.root-servers.net.     3600000 IN      A       192.203.230.10
< f.root-servers.net.     3600000 IN      A       192.5.5.241
< g.root-servers.net.     3600000 IN      A       192.112.36.4
< h.root-servers.net.     3600000 IN      A       128.63.2.53
< i.root-servers.net.     3600000 IN      A       192.36.148.17
< j.root-servers.net.     3600000 IN      A       192.58.128.30
< k.root-servers.net.     3600000 IN      A       193.0.14.129
< l.root-servers.net.     3600000 IN      A       198.32.64.12
< m.root-servers.net.     3600000 IN      A       202.12.27.33
<
< ;; Query time: 220 msec
< ;; SERVER: 216.21.232.10#53(216.21.232.10)
< ;; WHEN: Thu Feb 12 18:45:38 2009
< ;; MSG SIZE  rcvd: 536
<
< $
<
< Here's the same, but only using UDP - note that it dropped 2 of 13 of
< the "additional" records (all of which can be seen above):
< $ 2>&1 dig @216.21.232.10 -t A www.forallthewaysyoucare.com. +ignore
< ;; Warning: Message parser reports malformed message packet.
<
< ; <<>> DiG 9.2.4 <<>> @216.21.232.10 -t A www.forallthewaysyoucare.com.
< +ignore
< ;; global options:  printcmd
< ;; Got answer:
< ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 6677
< ;; flags: qr aa tc rd; QUERY: 1, ANSWER: 1, AUTHORITY: 13, ADDITIONAL:
< 13
<
< ;; QUESTION SECTION:
< ;www.forallthewaysyoucare.com.  IN      A
<
< ;; ANSWER SECTION:
< www.forallthewaysyoucare.com. 14400 IN  CNAME
< forallthewaysyoucare.com.awarenessnetworks.com.edgesuite.net.
<
< ;; AUTHORITY SECTION:
< .                       518400  IN      NS      a.root-servers.net.
< .                       518400  IN      NS      b.root-servers.net.
< .                       518400  IN      NS      c.root-servers.net.
< .                       518400  IN      NS      d.root-servers.net.
< .                       518400  IN      NS      e.root-servers.net.
< .                       518400  IN      NS      f.root-servers.net.
< .                       518400  IN      NS      g.root-servers.net.
< .                       518400  IN      NS      h.root-servers.net.
< .                       518400  IN      NS      i.root-servers.net.
< .                       518400  IN      NS      j.root-servers.net.
< .                       518400  IN      NS      k.root-servers.net.
< .                       518400  IN      NS      l.root-servers.net.
< .                       518400  IN      NS      m.root-servers.net.
<
< ;; ADDITIONAL SECTION:
< a.root-servers.net.     3600000 IN      A       198.41.0.4
< b.root-servers.net.     3600000 IN      A       192.228.79.201
< c.root-servers.net.     3600000 IN      A       192.33.4.12
< d.root-servers.net.     3600000 IN      A       128.8.10.90
< e.root-servers.net.     3600000 IN      A       192.203.230.10
< f.root-servers.net.     3600000 IN      A       192.5.5.241
< g.root-servers.net.     3600000 IN      A       192.112.36.4
< h.root-servers.net.     3600000 IN      A       128.63.2.53
< i.root-servers.net.     3600000 IN      A       192.36.148.17
< j.root-servers.net.     3600000 IN      A       192.58.128.30
< k.root-servers.net.     3600000 IN      A       193.0.14.129
<
< ;; Query time: 112 msec
< ;; SERVER: 216.21.232.10#53(216.21.232.10)
< ;; WHEN: Thu Feb 12 18:47:11 2009
< ;; MSG SIZE  rcvd: 512
<
< $
<
<
< And here's example, jumping straight to TCP to start with (notice it's
< all there, and no complaints about malformed or truncated):

I'd typically use +tcp, here, but instead I used +vc, as that also works
on some older dig clients within the environment I was discussing, that
don't support the +tcp option.

< $ 2>&1 dig @216.21.232.10 -t A www.forallthewaysyoucare.com. +vc
<
< ; <<>> DiG 9.2.4 <<>> @216.21.232.10 -t A www.forallthewaysyoucare.com.
< +vc
< ;; global options:  printcmd
< ;; Got answer:
< ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 48256
< ;; flags: qr rd; QUERY: 1, ANSWER: 1, AUTHORITY: 13, ADDITIONAL: 13
<
< ;; QUESTION SECTION:
< ;www.forallthewaysyoucare.com.  IN      A
<
< ;; ANSWER SECTION:
< www.forallthewaysyoucare.com. 14400 IN  CNAME
< forallthewaysyoucare.com.awarenessnetworks.com.edgesuite.net.
<
< ;; AUTHORITY SECTION:
< .                       518400  IN      NS      a.root-servers.net.
< .                       518400  IN      NS      b.root-servers.net.
< .                       518400  IN      NS      c.root-servers.net.
< .                       518400  IN      NS      d.root-servers.net.
< .                       518400  IN      NS      e.root-servers.net.
< .                       518400  IN      NS      f.root-servers.net.
< .                       518400  IN      NS      g.root-servers.net.
< .                       518400  IN      NS      h.root-servers.net.
< .                       518400  IN      NS      i.root-servers.net.
< .                       518400  IN      NS      j.root-servers.net.
< .                       518400  IN      NS      k.root-servers.net.
< .                       518400  IN      NS      l.root-servers.net.
< .                       518400  IN      NS      m.root-servers.net.
<
< ;; ADDITIONAL SECTION:
< a.root-servers.net.     3600000 IN      A       198.41.0.4
< b.root-servers.net.     3600000 IN      A       192.228.79.201
< c.root-servers.net.     3600000 IN      A       192.33.4.12
< d.root-servers.net.     3600000 IN      A       128.8.10.90
< e.root-servers.net.     3600000 IN      A       192.203.230.10
< f.root-servers.net.     3600000 IN      A       192.5.5.241
< g.root-servers.net.     3600000 IN      A       192.112.36.4
< h.root-servers.net.     3600000 IN      A       128.63.2.53
< i.root-servers.net.     3600000 IN      A       192.36.148.17
< j.root-servers.net.     3600000 IN      A       192.58.128.30
< k.root-servers.net.     3600000 IN      A       193.0.14.129
< l.root-servers.net.     3600000 IN      A       198.32.64.12
< m.root-servers.net.     3600000 IN      A       202.12.27.33
<
< ;; Query time: 219 msec
< ;; SERVER: 216.21.232.10#53(216.21.232.10)
< ;; WHEN: Thu Feb 12 18:49:33 2009
< ;; MSG SIZE  rcvd: 536
<
< $

Anyway, the above example set isn't for NS record (or that and related
A records), but similar applies for NS ... in terms of what is/isn't
"too many" NS servers.  If we look for NS (and are also interested in A
records) for com., and asking an authoritative com. NS, we see:

$ dig @192.41.162.30 -t NS com.

; <<>> DiG 9.2.4 <<>> @192.41.162.30 -t NS com.
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 8795
;; flags: qr aa rd; QUERY: 1, ANSWER: 13, AUTHORITY: 0, ADDITIONAL: 15

;; QUESTION SECTION:
;com.                           IN      NS

;; ANSWER SECTION:
com.                    172800  IN      NS      a.gtld-servers.net.
com.                    172800  IN      NS      d.gtld-servers.net.
com.                    172800  IN      NS      f.gtld-servers.net.
com.                    172800  IN      NS      k.gtld-servers.net.
com.                    172800  IN      NS      m.gtld-servers.net.
com.                    172800  IN      NS      h.gtld-servers.net.
com.                    172800  IN      NS      g.gtld-servers.net.
com.                    172800  IN      NS      l.gtld-servers.net.
com.                    172800  IN      NS      e.gtld-servers.net.
com.                    172800  IN      NS      b.gtld-servers.net.
com.                    172800  IN      NS      c.gtld-servers.net.
com.                    172800  IN      NS      i.gtld-servers.net.
com.                    172800  IN      NS      j.gtld-servers.net.

;; ADDITIONAL SECTION:
a.gtld-servers.net.     172800  IN      A       192.5.6.30
a.gtld-servers.net.     172800  IN      AAAA    2001:503:a83e::2:30
d.gtld-servers.net.     172800  IN      A       192.31.80.30
f.gtld-servers.net.     172800  IN      A       192.35.51.30
k.gtld-servers.net.     172800  IN      A       192.52.178.30
m.gtld-servers.net.     172800  IN      A       192.55.83.30
h.gtld-servers.net.     172800  IN      A       192.54.112.30
g.gtld-servers.net.     172800  IN      A       192.42.93.30
l.gtld-servers.net.     172800  IN      A       192.41.162.30
e.gtld-servers.net.     172800  IN      A       192.12.94.30
b.gtld-servers.net.     172800  IN      A       192.33.14.30
b.gtld-servers.net.     172800  IN      AAAA    2001:503:231d::2:30
c.gtld-servers.net.     172800  IN      A       192.26.92.30
i.gtld-servers.net.     172800  IN      A       192.43.172.30
j.gtld-servers.net.     172800  IN      A       192.48.79.30

;; Query time: 97 msec
;; SERVER: 192.41.162.30#53(192.41.162.30)
;; WHEN: Sun Mar  1 12:44:13 2009
;; MSG SIZE  rcvd: 509

$
Well, ... okay, the rules of the game are getting slightly more complex
with IPv6 also in the picture.  But back in ye ancient history (okay,
not too many years back), when . and com. weren't yet doing IPv6, 13
was the maximum NS servers that would all be guaranteed to fit in a
typical DNS NS UDP reply - hence they had (and actually still do have)
13 nameservers.  If one has a really long domain, perhaps even seven
might possibly be "too many" nameservers ... but for the more typical
cases, minimum of 3, maximum of 7 would be the general best practice.

In the example further above, with www.forallthewaysyoucare.com., we
have as answer, the rather lengthy:
www.forallthewaysyoucare.com. 14400 IN  CNAME
forallthewaysyoucare.com.awarenessnetworks.com.edgesuite.net.
plus the reply also includes the query itself, which is also fairly
long:
www.forallthewaysyoucare.com. IN A
But the server also wants to try and be helpful, and since the CNAME is
to a different domain entirely, it wants to also tell us about those NS
records, and the corresponding A records for those.  Now, that NS and A
record data itself, would fit in a single UDP packet without
truncation, but with the not-so-short:
www.forallthewaysyoucare.com. 14400 IN  CNAME
forallthewaysyoucare.com.awarenessnetworks.com.edgesuite.net.
and query:
www.forallthewaysyoucare.com.IN A
also wanting to be in that UDP reply packet, it can't all fit, so some
of that (the less critical listing of all the A records for the NS
servers) is truncated.

In that particular case, we not only had UDP-->TCP promotion, ... and
that all worked fine, but we also had a slightly surprising bit of
firewall excitement.  Turns out a firewall was also looking at the
contents of the UDP packet, and analyzing the contents for being nice
and DNS picture perfect and fine and safe, deciding it didn't like
something about the packet, and was dropping the response packet
entirely.  So we had a slightly funky intermittent bit of DNS failure
... as things worked fine once the data made it through somehow and was
cached behind the firewall (e.g. if one asked independently for the
various constituent components), but was mostly failing in general for
that particular lookup.

Another bit on DNS NS and redundancy, etc.  Once upon a time, a single
IP was (almost always, if not always) a single NS server, ... that's no
longer necessarily the case.  In many cases, a single IP can have
behind it highly redundant routing, multiple redundant ISPs, multiple,
and geographically distributed and highly available redundant / load
balancing servers.  So, yes, minimum of 3 nameservers is still best
practice, ... but that's not necessarily 3 IPs ... single IP can have
many nameservers behind it.  Note that some registrars still require 3
IPs minimum, and that at least one, if not all of them, be on distinct
networks.  Other registrars only require a minimum of two IPs, and they
needn't be on distinct subnets/networks ... not sure but there may be
some registrars that will allow as little as just one IP.

> Admittedly, sf-lug.com would not have been saved from downtime by a
> second slave, given that y'all failed to notice for a couple of week
> your master nameserver being offline, but achieving at least the
> recommended level of redundancy will save you from most other types of
> outages.
>
> I can offer you SVLUG's nameserver as a second slave.  NS1.SVLUG.ORG, IP
> 64.62.190.98.  Just add it to ns1.sf-lug.com's allowed-transfer ACL in
> /etc/bind/named.conf, restart BIND9, and let me know.  I'll set up slave
> nameservice and confirm that it can pull down zones and answer quereies,
> and you then add it to the authoritative list.

For running production BIND, generally much safer to reload (and then
inspect logs for errors, and test to confirm), than restart.  In most
cases with a reload, if there's a configuration error, things won't
grind to a screeching halt - but BIND will generally continue to serve
older data that was good before configuration booboo was made.  In the
case of restart with configuration error(s), BIND is rather
unforgiving, typically either refusing to start entirely, or refusing
to serve data for zones with configuration errors.

> You really should not keep trying to get by with only two.  Bad idea.
> Really.

Jim - if you're interested, let me know - I can also point you at an
excellent free resource for DNS slave that I found when I was
researching such for BALUG.  Also, since I believe there are some
physical moves and IP reassignments planned for sometime in the future
for the sf-lug.com. box, it might be simpler to defer adding slave(s)
until master IP is quite long-term stabilized (changing master IP is a
bit of work for all the slaves of that master ... generally nice to
make things as easy as feasible for one's slaves - particularly if
that's free donated services).  BALUG could probably *also* offer some
DNS slave services ... but we can probably find you something even
higher availability than the BALUG host I'd have in mind for that.

> Ideally, you should (and, well, I should, too) also set up a little
> cronjob, using "dig" to query the master and all slaves for the SOA
> record, and then parsing out the zonefile S/N using awk and making sure
> all nameservers are reporting the same value -- and sending admins
> e-mail if any of the machines isn't serving up data or isn't up-to-date.
> Running that daily would more than suffice.  Shouldn't be difficult.
>
> Why?  Because, as one learns the hard way, the guy who promised to do
> secondary nameservice for you a year ago will often forget, shut it off,
> and not bother to tell you.

Well, really, ideally :-) SF-LUG (and BALUG) should have appropriate
monitoring set up ... and much of that monitoring should exist on
systems *other* than those they're monitoring (so, that, e.g., if
SF-LUG system A no longer has, say it's web or DNS or email service
available to the Internet, some other system can report on that ...
even if system A has lost all connectivity to The Internet).  Anyway,
that would be for monitoring Internet services.  That doesn't mean one
shouldn't *also* have monitoring on the system itself (e.g. security
and other items that may be better monitored locally).  On the local
monitoring bit, the reporting system may or may not be local (e.g. the
system that issues the alerts may be a distinct system).

> Actually, in fact, that aforementioned cronjob, to make it _really_
> useful, would also query either "whois" or the NS records inside the
> parent zone's records (in this case, the .com TLD zone's
> nameservers[1]), to find out whether the master and slaves are still
> authoritative.  Because the other mishap that keeps occurring is:  A
> year ago, you agreed to do secondary for a friend's domain, and have
> been doing it faithfully.  One day, it somehow occurs to you to reverify
> that your secondary service is still authoritative, and suddenly it's
> not:  You've been doing secondary pointlessly for an unknown number of
> months, because the bozo ceased using your service and failed to mention
> that fact to you.

Yes, good to monitor registry bits ... but I'd treat that as a rather
distinct matter, as compared to DNS.  Sure, ... there's a bit of
overlap, ... but it can be the case that registry data (as reported by
whois) is correct, yet the registry DNS delegation isn't correct (may
not be common, but thus far I've seen it once where a TLD registry had
the nameservers correct in whois data, but that data didn't match
behavior of the authoritative nameservers ... even after much more than
sufficient time for any updating of the authoritative nameservers).

> Last, Jim, have you considered graduating to something better than
> BIND9?  Seriously.  It's a dreadful piece of code:  slow, RAM-grabbing,
> overfeatured, and with a questionable security model.  It's no longer as
> scandalously buggy as BIND8 was, having gone through a total rewrite,
> but it's still scandalously bad.

Well, opinions on BIND9 will vary :-) ... but I certainly agree with at
least many of Rick's points about it.

> I realise you're, in part, trying to teach aspiring sysadmins how to
> wrangle the same terrible software they're likely to encounter in
> industry, but, now that you've done that for a while and learned the
> ropes, maybe you're ready to switch to something that doesn't suck.
>
> If ns1.sf-lug.com is doing only authoritative service, e.g., the
> machine's /etc/resolv.conf doesn't point to it for general ("recursive")
> nameservice, then look no further than NSD.  I can send you an example
> setup, as it's what we use on NS1.SVLUG.ORG.

At quick glance that would look rather feasible for the sf-lug.com. box:
$ hostname && stat /etc/resolv.conf | grep '^[MC]'
sf-lug
Modify: 2006-12-07 12:12:16.000000000 -0800
Change: 2006-12-07 12:12:16.000000000 -0800
$ cat /etc/resolv.conf
nameserver 64.81.79.2
nameserver 216.93.160.16
nameserver 216.93.170.17
search localdomain
$
On the other hand, there may be other factors to consider - e.g.: does
that distribution have an NSD package?  Teaching/training aspects of the
sf-lug.com. box?

> Date: Sat, 28 Feb 2009 16:56:46 -0800
> From: Rick Moen <rick at linuxmafia.com>
> Subject: Re: [sf-lug] balug.org DNS (was: sf-lug.com web site
>       accessibility)
> To: sf-lug at linuxmafia.com
>
> As an additional exercise, I'll now show you how I might check out
> whether balug.org DNS is OK.  This is something _anyone_ can do,
> even MS-Windows sufferers.  (There's a free-of-charge download of
> "whois" and "dig" programs in MS-Windows executable format on the Web.)

Perfectly valid analysis of the basic balug.org. and www.balug.org.
bits.  But as I alluded to earlier
(http://linuxmafia.com/pipermail/sf-lug/2009q1/006428.html)
the full picture of DNS for balug.org. gets a fair bit more complex ...
most notably we have various transition strategies in place ... so
we're still rather in a "between" state, ... though Rick well covered
the most noteworthy and visible "production" bits.

If one dig(1)s (okay, pun wasn't initially intended, but boy it works
... especially when adding (1)) a bit deeper, one may find things fairly
interesting in/around, oh, ... say around balug.org. and new.balug.org. and
@ns1.balug.org.
@ns1.everydns.net.
@ns2.everydns.net.
@ns3.everydns.net.
@ns4.everydns.net.
@150.135.84.2
As to why :-) most of that is very well detailed in the applicable
named*.conf file and master zone files (and the RCS files, if one wants
to also know how we got to present state).  A fair bit of higher level
bits and strategies, etc., have been earlier covered on the BALUG
"admin" list (http://lists.balug.org/listinfo.cgi/balug-admin-balug.org),
but the full DNS details (and most or all of their "why") are to be
found in the BALUG DNS configuration files on ns1.balug.org.

Other random note: everydns.net. - yeah, free DNS service and all that,
... but it's rather funky, so I wouldn't generally recommend it unless
one is well aware of its limitations and funkiness, and willing to put
up with them (in the case of BALUG it's "good enough" for at least
parts of a transitional strategy - we have other longer term plans;
also one advantage to everydns.net - one can reconfigure it to repoint
to different master(s), without need to bother a human administrator of
the slave(s) - that can be advantageous for a free service -
particularly if one anticipates doing one or more master IP transitions
in the future.).