[sf-lug] DNS: sf-lug.com., general, and balug.org. & miscellaneous, etc.

Thu Mar 26 04:18:43 PDT 2009

Re: DNS: sf-lug.com., general, and balug.org. & miscellaneous, etc.

> Date: Mon, 2 Mar 2009 02:15:50 -0800
> From: Rick Moen <rick at linuxmafia.com>
> Subject: Re: [sf-lug] DNS: sf-lug.com., general, and balug.org.
>
> Quoting Michael Paoli (Michael.Paoli at cal.berkeley.edu):
>
>> Why I jumped to checking queries against the slave
>
> Yeah, no problem there.  Actually, just about any starting point in
> finding the problem is fine; you just have to proceed systematically
> from there.  That's why you need to make sure you know where diagnostic
> "dig" queries _go_ (when dealing with situations like the recent one),
> rather than just omitting the "@" parameter and hoping for the best.

Yes, ... definitely more than one way to go about it ... key thing is
understanding DNS, how the queries and client tools (such as dig) work,
and logically working to isolate faults/problems or verify correctness,
depending on the scenario at hand.

I'll frequently jump in with dig without using @ ... but maybe only
about 40 to 60% of the time ... really much depends on the scenario.
Without @, and without using the +short option, dig does include in its
output, information on the IP from which it got its response (at least
if it got one).  Again, where I'll typically start much depends on the
scenario at hand, e.g.:

If the problem appears on "just" one client, or towards the client end
of things, I may often start without @, to see "just what the heck is
that client more-or-less doing by default, what's it getting, and from
where".  I'll commonly do "divide and conquer" from there, between
client(s) and authoritative nameserver(s) for the relevant data.

If the problem seems a bit more general - e.g. most stuff okay in a
zone/domain, but some data seems amiss, I'll more commonly start at, or
closer to, authoritative nameservers - and depending upon architecture
(is there a single originating master I can query, or do I not know the
details, and only know what the authoritative nameservers are?), I may
check multiple authoritative nameservers, or I might just start with
one, and then check some points between client(s) and authoritative
nameservers (e.g. are there intermediary nameservers involved)?

In some cases, problems may be further upstream - notably issues with
delegation of nameservers, so depending upon symptoms and such, it may
be quite appropriate to check for issues there.

In a new set-up (new nameservers and/or delegation, slaves, etc.) I'd
typically check end-to-end ... at least as much as feasible.  In a case
like that, where it may be rather to quite probable that it all works,
which end one starts from may not matter very much at all - except to
the extent that it can more quickly find the problems that are more
probable to exist - and that it's relatively quick and efficient to
verify everything that should be verified.

And, when in doubt, test :-).  Until one's dealt with DNS
troubleshooting a lot, it may not always work quite the way one might
expect ... and even with lots of experience, still, on occasion,
problems will show up in "interesting" places (e.g. like firewall that
worked with both UPD and DNS as it generally should ... except there
were some UDP packets it didn't like, and dropped on the floor ... which
broke DNS is a somewhat mysterious way (and a much less common firewall
DNS booboo; or case of a registrar that had the correct delegation data
in whois, but had precisely none of that NS data in their authoritative
nameservers - but those are the exceptions (hit each of those just once
so far), rather than the rule (e.g. other firewall misconfigurations -
such as allowing UDP but not TCP - are far more common problems)).
Sometimes one may even need to dig down into the bits - e.g. tcpdump
(that's what quickly provided the smoking gun in that oddball firewall
problem case).

> Something Jim Dennis talks about, in his lectures on system
> administration, is that the concepts of unit testing from extreme
> programming are exactly what a sysadmin needs, to do each job right the
> first time.  That is, you need to include in the planning and execution
> of each task the thinking out and execution of a suitable means to test
> the thing you're doing.  Testing should be _integral_ to each task.
>
> I used to tell my staff at Linuxcare a variation on that concept:  I'd
> tell them "Your task isn't done until it's tested."  If the task was to
> set up a piece of software, then the task isn't done until you've made
> that software perform its function.  If the task includes making
> software start at boot time, then you need to schedule a reboot to test
> your assumption that everything will work OK during startup.  (Even on
> production servers, you can work in a planned reboot _sometime_.  It's
> better than finding out only during _unplanned_ reboots whether startup
> is OK.)

Yes, ... in many cases, my notification that something's been done are
like:
"
done:
$ <command(s) that show I tested it>
<output that shows it worked as expected>
...
"
... or references to item(s) that have the details (e.g. a change record
that has test results and a sign-off that the change was verified
(tested) as successful).

The reboots may not be frequent, ... but yes, ... sooner or later.  Good
to also well track what's to be tested at that reboot (be it scheduled,
or be it a "drats, it's down anyway and we've failed over to our other
system anyway, so, as long as we can take a wee bit of time bringing it
up, let's also complete that test on ...").  So, ... those lists of "to
be checked at (re)boot" (and "maintenance to be done when system brought
down", etc.) do come in quite handy.

>> Maximum ... well, depends, in certain cases that's as high as, but no
>> higher than 13.
>
> "As high as"?  Fsck no.
> nameservers, and seven is the practical limit -- and the _recommended_
> limit -- for almost all situations.  (The root nameservers are an
> anomalous case, for reasons I'd rather not get into.)

Anomalies can be fun.  :-)  For 13, I was specifically thinking root
nameservers, and TLD DNS servers, for quite short TLDs (like two letter
countries and com. - even museum. is likely too long for 13).  But yes,
13 is special case where the vast overwhelming majority of us will never
be doing DNS administration of those special cases.

>> So, ... what really bad happens with too many?  The complete DNS
>> response isn't guaranteed to all fit within a single UDP packet.
>
> That is only the beginning of the problems, and the simplest and most
> mechanistic problem, one is likely to have.  Do I _really_ need to get
> into that?

Yes, yes, too many is very bad.  For non-anomalous (i.e. most all)
scenarios, seven, as you pointed out, is practical and recommended limit
for almost all situations.

> It's entirely and spectacularly irrelevant to my point whether Jim, in
> the previously discussed situation, does "rndc reload" or "service named
> restart".  I.e., you are wandering off onto a typical
> obsessive-compulsive geek irrelevancy.  However, that being said, in
> Jim's shoes, if I had a fatal configuration error within that pice of
> cr__ BIND9's conffiles, I'd much rather know sooner than later.

Sure, ... in case like sf-lug.com. where high availability of the
nameserver over a relatively short interval isn't crucial, a restart is
fine.

In cases, however, where high availability of the nameserver is very
important (e.g. it's getting pounded, and when it goes down, many
clients will experience undesirable latencies in resolving DNS),
doing a reload keeps the service up and running - even in most all cases
where there's a (typical zone) configuration error - and that an error
has occurred, is generally rather easily seen by checking the logs
and/or noting that in such cases, the nameserver is still generally
serving the prior zone file's data (older serial number), rather than
the newer serial number and zone data in the flawed zone file.  In
rather to quite critical production scenarios, this can be the much more
graceful way to find (and then correct) a zone file error.

>> Jim - if you're interested, let me know - I can also point you at an
>> excellent free resource for DNS slave that I found when I was
>> researching such for BALUG.
>
> I hope you're not yet another person pushing EveryDNS, with its broken
> djbware-based implementation that doesn't support AXFR and ignores
> NOTIFY.

Ewww, egad no.  Friends don't let friends recommend EveryDNS ;-)  (Okay,
*maybe* I wouldn't say it's beyond recommendation for *all* scenarios,
but it sucks very seriously in many major ways, and I wouldn't generally
recommend it.)  In the case of BALUG.ORG., yes, we're using it, but NOT
for any (critical) production - basically if EveryDNS blew up or started
serving totally bogus data (some would argue it already does that ;-))
for BALUG.ORG., it would have negligible impact on BALUG operations
(e.g. it wouldn't impact the main webserver ([www.]balug.org.), the
lists, or the list archives). - what little "damage" it could do would
at most inconvenience some BALUG webmaster(s) that aren't also full
BALUG systems administrators (a small and not very active in that
capacity set) ... so, ... no biggie there.  Longer terms we plan to move
BALUG fully off of EveryDNS ... likely after some of our DNS dust
settles (most notably IPs of master(s) - at present all three in or
candidates for that role will be very likely to be having their IPs
change later this year).

> I respect Ulevitch and crew, but losing the ability to have timely
> updates to seconaries is a pretty sad disadvantage, especially given
> that any number of people in the Valley will be glad to give you
> more-competent secondary service for free, too.

EveryDNS has *many* issues, ... last time I bothered to note them in
detail (from a zone file of a subdomain of BALUG.ORG. (e.g.
test.balug.org.):
; For EveryDNS.net, NOTE AT LEAST THE FOLLOWING (at least as of
; 2007-05-26):
;
; Cannot be set up and function as slave(s) until one or more of
; their nameservers have been delegated as an NS via a chain of
; authority from the root nameservers.
;
; WILL NOT ACCEPT/LOAD ALL VALID RR TYPES (is not running BIND)
;
; LIMITS TO 200 THE TOTAL NUMBER OF RECORDS it will load for free for
; any given account or domain.
;
; ENFORCES CERTAIN MINIMUM TTLs.
;
; FAILs TO ACCEPT TCP CONNECTIONS on ns[123].everydns.net.  Among
; other things this would likely impact larger records and responses
; to queries where all the data (e.g. multiple records) could not fit
; within a single UDP reply.
;
; DOES WORK (including answering queries) with TCP on:
; ns4.everydns.net.
;
; Accepts but ignores notify.
;
; Will pick up zone updates at most once per hour (and presumably
; only if SOA serial number indicates there's been an update).
;
; Some of these items can be checked with:
; http://www.dnsreport.com/
; and/or other tools.
;
; For more information, also check under:
; http://www.everydns.net/

What I had in mind to recommend to Jim:
o a relatively high availability (HA) slave (one main DNS IP on HA
   cluster in data center)
o not dependent or mostly dependent upon one single person for its
   continued operation (and updates, maintenance, configuration changes,
   etc.)
o is actually operated as part of a (college/university) faculty
   sponsored CS student group - DNS being one of the services they
   maintain on the HA cluster system.
o free

There are many other good/excellent possibilities out there too, but the
above is the one I specifically had in mind, anyway.

>> Yes, good to monitor registry bits ... but I'd treat that as a rather
>> distinct matter, as compared to DNS.
>
> Um, excuse me?  The identity of which IPs are authoritative is "a
> distinct matter from DNS"?  In what universe?
>
>> but thus far I've seen it once where a TLD registry had
>> the nameservers correct in whois data, but that data didn't match
>> behavior of the authoritative nameservers
>
> That's why you check the glue records in the parent zone, genius.

Registry whois data about DNS *should* be consistent with DNS - but thus
far I've encountered one case where it wasn't - and no, not a matter of
glue records at all (the nameservers were in a different domain
altogether).  In that one case I ran into (some two letter country TLD
... I don't recall what country at this point), the registrar whois
data about our DNS was all correct - and had been correct for quite a
while - but the actual DNS was another story.  Most notably, the NS
records for the nameservers were missing from the authoritative
nameservers for the two letter country TLD (and a slight bit of prodding
to the registrar pointing this out, and they corrected it).

>> Well, opinions on BIND9 will vary :-) ... but I certainly agree with at
>> least many of Rick's points about it.
>
> _Many_?  Are you prepared to seriously assert that...

I didn't say that opinions varied much or at all on individual
particular aspects of BIND9 ;-)

For the various (mis-)features and capabilities one wants or requires,
and/or doesn't want, BIND9 and alternatives (e.g. NSD), will each fit
well, better, or not at all, in different scenarios.

And thanks also for providing lots of good information on NSD.

>> If one dig(1)s (okay, pun wasn't initially intended, but boy it works
>> ... especially when adding (1)) a bit deeper, one may find things fairly
>> interesting in/around, oh, ... say around balug.org. and new.balug.org. and
>> @ns1.balug.org.
>> @ns1.everydns.net.
>> @ns2.everydns.net.

>> @ns3.everydns.net.
>> @ns4.everydns.net.
>> @150.135.84.2
>
> Ugh.  Broken EveryDNS secondary nameservice.  Ignores NOTIFY, doesn't do
> AXFR.  Avoid.

Yes, yes, EveryDNS deficiencies (and abominations) duly noted.
If one just pokes around, e.g.:
balug.org.
www.balug.org.
lists.balug.org.
one will however find everydns.net nowhere to be found.