[sf-lug] resolver problem

Rick Moen rick at linuxmafia.com
Fri Apr 8 13:03:26 PDT 2016

Quoting Michael Paoli (Michael.Paoli at cal.berkeley.edu):

> Really quite up to you.
> I think it likely at least some folk(s) will continue to be interested.
> You could always and/or additionally, e.g.:
> bring the laptop to a SF-LUG meeting (does it have same issues when
> on other networks?),
> bring the laptop to a CABAL meeting (potentially even fair bit more
> time to dig to bottom of odd problems),
> ask on some other list(s)/forum(s) (different and/or additional folks
> might come up with something useful - even "answer" - that we'd not
> thought of ... and be sure to include relevant details/history if
> you do so ask, to avoid unnecessary redundant work).

> I'm guessing at this point something is likely messed up on the
> installation/configuration ... perhaps a library, or some quirky
> environmental bit ... who knows.

I agree.  More below.

> Differences in behavior with
> strace data between working and not, with same command (e.g. host &
> same hardware booted from DVD), likely has an "answer", or substantial
> clues *somewhere* in there ... but strace can be qutie a bit of data
> to go through (grep and family can be good friends).

I certainly concur.  Having served my time with tools like strace (logs
a process's system calls), ltrace (logs a process's library calls), and
tcpdump (logs all network activity) that generate huge mountains of
logging data, I can say that sometimes they're a useful last resort for
puzzling problems.  They cast a wide net to collect data, and you need
to spend some time learning how to use those tools _and_ how to
interpret and sift though their prodigious amounts of cryptic output.

Alex, let's take a moment to consider the question of what _could_ be
the root cause of this situation.

Now, I'll admit I'm actually a tiny bit lost after about 40 messages.
There were some messages where it appeared that the problem was
'ThinkPad's ethernet and/or wireless ports aren't getting DHCP leases.'
There is now what looks to be an unrelated(?) set where the problem is 
'Some Internet client applications cannot get DNS resolution, but others
can.'  So, I'm left a little at sea.  Are we talking about two
_different_ simultaneous problems?  And, if so, how can you even be
getting to 'apps cannot do DNS' if your network port isn't getting an IP
address?  Is this two different computers?

But I'll set that concern aside and return to the 'apps cannot do DNS'
problem.  What you described was:  Everything used to work, but then 
some unknown badness event happened.  After the event, _some_ Internet
clients (Firefox, wget, etc.) cannot get the DNS resolution they need.
Others (such as Chromium) can.

One of the rules of diagnosis is:  Distrust coincidences, as they're
rare.  So, it's really unlikely that all of the Internet apps that 
stopped being able to get DNS resolution got simultaneous and separate
breakage to their individual codebases.  Right?  That would be
meteor-hitting-the-point-of-the-Transameria-Pyramid freaky, so we
dismiss the possibility.  It must be a common system resource that most,
but not a few (Chromium and whatever few others) all rely on.

Michael and I have alluded to the system resources that do DNS.  There
are a number of conffiles in the /etc tree that control the behaviour of
name resolution:  /etc/resolv.conf, /etc/nsswitch.conf, /etc/hosts, the
/etc/pam.d/ tree, /etc/host.conf, and probably others I'm not
remembering.  The functionality those files regulate are in various
system libraries (binaries indirectly rather than directly executed),
including the system DNS resolver library libresolv, shipped as part of
GNU libc = glibc.  And a bunch of others that interact with that library
and others and with the system TCP/IP stack generally (and thus the
running kernel), includign for example the PAM libraries.

All of the conffiles and libraries I refer to in the prior paragraph are
_system_ libraries and therefore protected against tampering by
restricted file ownership and permissions -- which is to say that you as
an unprivileged user cannot damage them by accident.  They get put into
their initial running state by the Linux distribution (e.g., Ubuntu)
installer unpacking them from software packages and placing them in
system directory trees.  You would -- of course -- normally leave them
alone, with occasional exceptions such as adding and removing local
static name-and-IP mappings in /etc/hosts, adjusting which DNS
nameserver IPs your system will use in /etc/resolv.conf, and so on.
But, otherwise, you normally leave them alone, and they Just Work.

Given that you sensibly left them alone, what then might have happened?

Bobbie referred in passing to something that _could_ have happened,
though it's quite speculative.  Hypothetically, if your hard drive is
starting to fail and developing bad patches, one of those developing bad
patches might be right where a key system library lives.  Likely
possibility?  No, not hardly.  But it _could_ happen.

Also, as I suggested upthread, as a Unix user with ability to wield
superuser access (i.e., a system administrator), the most likely threat
to your system is actually you.  (I'm emphatically not excluding myself.)
Every time you su to a root shell (which you probably do not do, as that
is not the 'Ubuntu Way'), and every time you carry out a privileged
operation using sudo (which _is_ the 'Ubuntu Way', and tends IMO to be
carried out with frightening casual inattention by many users), you 
are wielding powers that can and sometimes does clobber, cripple, mangle
(etc., pick your metaphor) your system in an instant.

Did you accidentally clobber something when wielding system privilege?
It's obviously pretty difficult to say after the fact, though I suppose
you could look through the sudo logs.

Of course, also every time you do system maintenance, adding and
removing packages, etc., you are running programs that carry out
privileged operations on your behalf.  If you were were, say, updating
system packages using apt-get, aptitude, synaptic, or whatever's
fashionable for package operations these days, and the writing to system
directories of, say, a key PAM library went cockeyed because of a random
system glitch at an unfortunate moment, the package update _could_ have
silently damaged your installation.  Likely?  Not as such, but perhaps a
little more likely than a bad sector on your hard disk.

So:  You could collect more data by learning to use strace and giving
Michael a huge file of collected syscalls to sift through.  Even though
that's kind of a a Hail Mary Pass where you're saying 'I'm unsure what's
going on, so I'm going to see if it's a particular syscall failing in
hopes of isolating it to a broken system facility', this could permit
Michael to say 'Oh, I see it now.  Somehhow your [foo] library got
broken.  You should just force-reinstall the [foo] package.'  Happy

Or, you could decide to end the expenditure of time and say to yourself,
'Self, I have have high confidence that this system self-mutilation,
however it originated, is in no way a recurring thing but rather a
one-time freak occurrence, and I care less about finding the smoking gun
than I do about making the problem go away.  So, fsck it, I'm going to
just reinstall.'

Ordinarily, I'm in the front ranks of Unix greybeards saying 'No!  Do
_not_ just blow your system away and reinstall just because something
went wrong and you don't understand it.  If you do that, how will you
ever learn anything?  And, it's probably a simple fix.'  In this case,
if it _is_ a simple fix, I'm not seeing it, and it's your judgement call
how much more time you want to spend collecting information before just
bagging it and making the problem go away.

If you leave the question as 'I'll leave this system broken because
people are posting to this thread and keep carrying out your
suggestions', OK, far be it from me to criticise.  It's _your_ broken
system; you get to decide how long you keep it around, broken, just to
collect other people's speculations and suggestions.  At some point, you
might decide having an unbroken system is more important.

More information about the sf-lug mailing list