[conspire] What happens when 25% of your domain's nameservers give wrong answers
Rick Moen
rick at linuxmafia.com
Tue Mar 1 22:01:20 PST 2016
----- Forwarded message from Rick Moen <rick at linuxmafia.com> -----
Date: Tue, 1 Mar 2016 19:23:14 -0800
From: Rick Moen <rick at linuxmafia.com>
To: Mark Olson
Cc: Edie Stern, Joe Siclari
Subject: Re: Help me Obi-Wan!
Organization: If you lived here, you'd be $HOME already.
Quoting Mark Olson:
> Edie Stern and Joe Siclari (I think you know them -- they're
> long-time Florida fans who ran MagiCon and founded Fanac and now live
> in upstate NY -- I've copied them on this email)
/me waves.
> Basically, fanac.org seems to come and go, sometimes from minute to
> minute. It's an intermittent problem, experienced from multiple
> locations in the US at least. When it's gone, there's a "name not
> found" kind of error. A freebie monitoring subscription indicates
> that the site is never actually down -- it's always available from
> most of the endpoints, but it's down at any given endpoint about 5% of
> the time. Traces run when the site is not available don't help. They
> just show "name not found" kinds of errors.
Symptom as described strongly suggested a DNS problem -- and my thanks
for your care and precision in that description. You might not think
you're on top of the technology, but you folks have done a good job.
Basically, the Linneus Bottomus is that x7hosting is screwing up DNS
a bit -- on the DNS-transmitting end.
That's the executive summary. Details follow, further down.
> The hosting company has not been all that helpful. Their typical response
> is that this has to be a local issue - change your DNS. We use x7hosting.
They're trying to pass the buck and say it's on your DNS-receving end,
and they're mistaken. The problem is in the authoritative DNS for
fanac.org.
By the way, further down, there will be references to carrierzone.com:
That appears to be the same company as x7hosting, and they use the two
domains (x7hosting.com and carrierzone.com) interchangeably. Don't be
confused by that.
Here is where I have to get into the gory details. Sorry. When you own
a domain (say, fanac.org as example) and wish to have it active on the
Internet, between 3 and 7 DNS nameservers should be denoted as
'authoritative nameservers' for fanac.org name information -- the
authoritative place other nameservers and the general public consult for
published fanac.org DNS name/IP address data (and other DNS data).
Other nameservers happily fetch, cache, and republish data fetched
directly or indirectly from the authoritive servers.
Clear so far?
A public Internet database called WHOIS is one of the two ways to look
up for any domain where its authoritative servers are. (I'll omit here
the other way, in the name of saving time.) Unix machines including OS X
have a utility called 'whois' to look that stuff up. For Windows
people, you can download a Zip bundle of whois.exe, dig.exe, and
host.exe as open-source tools.
Looking up fanac.org's authoritative nameservers and cutting out
irrelevant cruft:
$ whois fanac.org | more
[...]
Name Server: NS1.X7HOSTING.COM
Name Server: NS2.X7HOSTING.COM
Name Server: NS3.X7HOSTING.COM
Name Server: NS4.X7HOSTING.COM
[...]
$
Now, here's where we get something useful: I'm going to use the tool
'dig' to ask each of those four namservers, in turn, for the 'soa' DNS
administrative record within the fanac.org domain.
$ dig -t soa fanac.org @NS1.X7HOSTING.COM +short
ns1.carrierzone.com. admin.carrierzone.com. 1007 86403 3600 3600000 86400
$ dig -t soa fanac.org @NS2.X7HOSTING.COM +short
ns1.carrierzone.com. admin.carrierzone.com. 1007 86403 3600 3600000 86400
$ dig -t soa fanac.org @NS3.X7HOSTING.COM +short
ns1.carrierzone.com. admin.carrierzone.com. 1007 86403 3600 3600000 86400
$ dig -t soa fanac.org @NS4.X7HOSTING.COM +short
$
Hmm, let's do that last one again without the '+short' flag, for maximum
detail:
$ dig -t soa fanac.org @NS4.X7HOSTING.COM
; <<>> DiG 9.8.3-P1 <<>> -t soa fanac.org @NS4.X7HOSTING.COM
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 44922
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available
;; QUESTION SECTION:
;fanac.org. IN SOA
;; Query time: 501 msec
;; SERVER: 64.49.104.96#53(64.49.104.96)
;; WHEN: Tue Mar 1 16:33:34 2016
;; MSG SIZE rcvd: 27
$
It's not refusing the query or saying the item asked about is
undefined. It's just giving a null answer -- which is a problem, as
that it's wrong and then becomes a cached wrong null answer at the user
end.
RESULT #1: That is my smoking gun of authoritative DNS problems: One
quarter of all public DNS queries about fanac.org are getting and
caching (retaining) an erroneous null answer.
Just to cross-check, let's ask all four about the very most common query
of all, the 'A' (forward-lookup) record for 'www.fanac.org':
$ dig -t a www.fanac.org @NS1.X7HOSTING.COM +short
64.29.145.9
$ dig -t a www.fanac.org @NS2.X7HOSTING.COM +short
64.29.145.9
$ dig -t a www.fanac.org @NS3.X7HOSTING.COM +short
64.29.145.9
$ dig -t a www.fanac.org @NS4.X7HOSTING.COM +short
$
Same (wrong) null result.
I'm not actually totally sure how long that wrong result is typically
going to get cached. Normally, all DNS data from the authoritative
servers travels with a time stamp and an integer in seconds accompanying
the data, called 'TTL' = time to live. By convention, if the DNS datum
is older than TTL upon arrival, it is to regarded as stale and not used.
Instead, a fresh query to the authoritative servers retrieves a new copy
which is then cached and used until its TTL also expires. This way,
most local DNS is served from cached data, rather the user's local
nameserver than having to always clobber the authoritative servers, but
nonetheless new data propagates quickly from those servers when
published.
Maybe your aggregate 5% failure rate reflects the (wrong) null result
being usually kept not very long, even though 1/4 of all queries get it.
RESULT #2: Several of the values x7hosting.com is publishing in the
critically important SOA (Start of Authority) record inside the DNS
zonefile (the set of DNS records for the domain) are way outside the
values recommended by Internet standards authorities, in a direction and
fashion that will significantly impair DNS reliability. However, the
only one of these that's really harmful is one called 'negative TTL',
which they've set ludicrously high. Details follow.
I would speculate that some of these misdefined values might have been
chosen to lower bandwidth costs they spend on published authoritative
DNS. But quite possibily this is just accidental error (Hanlon's Razor).
Again, sorry, but I need to go into gory details. The SOA record splits
into subfields, each of which is an important setting for the domain.
Here's the full SOA record from the first authoritative nameserver
(same as my original query, except without the +short flag). Omitting
+short results in very verbose output, so I'm going to cut the
irrelevant cruft from what's below:
$ dig -t soa fanac.org @NS1.X7HOSTING.COM
[...]
;; ANSWER SECTION:
fanac.org. 86400 IN SOA ns1.carrierzone.com. admin.carrierzone.com. 1007 86403 3600 3600000 86400
[...]
$
I'm going to reformat that answer to the pretty-printed standard layout
probably extant on-disk inside carrierzone/x7hosting's DNS namservers'
master copy of the DNS zone -- appending some comments to the right-hand
side:
fanac.org. 1D IN SOA ns1.carrierzone.com. admin.carrierzone.com. (
1007 ; serial number
86403 ; refresh ~1D: 1200 to 43200 recommended
3600 ; retry: typical 180 (3 min) to 900 (15 min)
3600000 ; expire: 1209600 to 2419200 recommended
86400 ; negative TTL: max 3 hrs (10800 secs)
)
;
We'll concentrate on the indented subfields.
The subfield S/N's value '1007' is a bit eccentric, but we'll disregard.
I'm going to pass over refresh, retry, and expire quickly because all
involve how often and when the secondary nameservers refetch copies of
the zonefile from the primary (master) nameserver. As all of this
concerns communications _among_ x7hosting's DNS namservers, it's really
their own affair, but suffice to say they are outside limits recommended
by published Internet standards -- in every case, in the direction of
less frequent communication among x7hosting's DNS namservers, ergo less
cost for network traffic. Maybe a cheapskate measure, maybe just them
being odd.
Negative TTL: Ever notice after your Web browser fails to resolve a DNS
name, attempts to refresh that tab instantly fail again for quite a few
minutes, and obviously isn't trying to connect at all? That's negative
TTL, which is how long in seconds a negative (cannot resolve this thing)
result shall be regarded as valid for a domain. I.e., it governs the
persistence of negative responses.
Small time periods are strongly recommended for negative TTL - 15
minutes to 2 hours. Internet standards document RFC 2308 defines the
maximum recommended value for negative TTL to be 3 hours (10,800
seconds).[1]
x7hosting.com has defined it for fanac.org to be 86,400 seconds, which
is one 24-hour day, about 24 times any reasonable value and eight times
as large as the largest value recommended.
The effect of excessive negative TTL is that failures to resolve the
domain's DNS, even transient ones caused by momentary hiccups, persist
for a long, long time and fail to go away. This is A Very Bad Thing.
SUMMARY - what to do about all this:
Well, gee, the cockeyed optimist's answer is: Tell x7hosting what it's
doing wrong, and wait for them to straighten up and fly right.
If you want to try that and it works, wonderful! I personally think
that's about as likely as Jeb Bush becoming the next US President.
The alternative: Have someone else besides x7hosting do the domain's
DNS. Note that this can be moved to elsewhere without moving 'hosting'
-- or, if you prefer, with it. These things are actually a la carte,
but many customers aren't aware of that fact.
[1] https://tools.ietf.org/html/rfc2308 It's at the end of section 5.
----- End forwarded message -----
More information about the conspire
mailing list