[conspire] systemd 8-O ; -) ... bind9 chroot Debian 9 (stretch) --> Debian 10 (buster)

Michael Paoli Michael.Paoli at cal.berkeley.edu
Sat Apr 18 04:03:26 PDT 2020


So, ... hitting a systemd issue I'd like to figure out and get resolved.
Yeah, I know, systemd, ugh ... but despite my also not much liking it,
if reasonably feasible, want to see if I can get this issue resolved.
So, bit 'o background:

So, ... working on (near) clone (balugclone) of system (balug).
Near?  As in starting about identical, then mostly changing "just
enough" (
clone:
     different Ethernet MAC address
     (before even first booting) down interface link:
     (
     link=down; mac=52:54:00:67:20:40
     virsh domif-setlink balugclone "$mac" "$link" --config
     virsh domif-setlink balugclone "$mac" "$link"
     virsh domif-getlink balugclone "$mac" --config
     virsh domif-getlink balugclone "$mac"
     )
     change network from bridged to default (RFC-1918 + NAT/SNAT)
     stop and disable potential conflicting services:
     systemctl stop & systemctl disable:
     mailman.service
     exim4.service
     apache2.service
     spamassassin.service
     rsync.service
     mariadb.service
     bind9.service
     ...
)
to avoid conflicts with the running production balug
Virtual Machine (VM) and its data, etc.
And, what for?  Do a pre-production Debian 9 (stretch) --> 10 (buster)
upgrade, to be able to plan for and have (theoretically) smooth actual
production upgrade.  Alas, last time around, wasn't quite thorough
enough:
https://lists.balug.org/pipermail/balug-admin/2020-February/001018.html

Anyway, this time, fair bit more progress (yea!) (notably working
through sanity checks of at least basic functionality of important services).

But alas, still bumping into one gottcha I've not yet found a fix for.
And, yup, systemd specific.

So, nameserver - running BIND9 under chroot.
If I fire it up manually, in manner that sysvinit would were it present:
# PATH=/sbin:/bin:/usr/sbin:/usr/bin start-stop-daemon --start --oknodo \
   --quiet --exec /usr/sbin/named --pidfile /run/named/named.pid -- \
   -u bind -t /var/lib/named
Then all appears fine, it runs fine, functions, keeps working, etc.
(note to safely test it on clone, also:
clone:
     /etc/network/interfaces disable interfaces except lo and change eth0
         to inet dhcp
     (eth0 & relevant configs later becomes ens3 through the upgrade)
     shutdown
     up interface link:
     (link=up; mac=52:54:00:67:20:40
     virsh domif-setlink balugclone "$mac" "$link" --config
     virsh domif-setlink balugclone "$mac" "$link"
     virsh domif-getlink balugclone "$mac" --config
     virsh domif-getlink balugclone "$mac"
     )
     boot
     and before enabling and attempting to (re)start bind9:
     bind9 all notify off (no)
     comment out notify-source and notify-source-v6
)

But alas, when started under systemd with:
# systemctl start bind9.service
Things go kind'a funky ... and fail in fairly short order.
First of all, as far as I can tell, from both systemd config,
and also looking at process arguments and such, looks like bind9
fires up properly under chroot in either case.
From: /etc/systemd/system/bind9.service.d/bind9.conf
we have:
ExecStart=/usr/sbin/named -f -u bind -t /var/lib/named

Also, without that -f option there (and after:
# systemctl daemon-reload
)
it then effectively doesn't (as far as systemd/systemctl is concerned)
work at all, failing quite immediately with:
systemd[1]: bind9.service: Control process exited, code=exited,  
status=1/FAILURE
... even though bind9/named is and continues to run fine in that case ...
but the systemd/systemctl status is all wrong, as it thinks it failed,
so, need the -f option.  Anyway, back to with -f (foreground) option:

And ... smoking gun ... strace(1).
It looks like in both cases (manual sysvinit-like start, or
systemd:
# systemctl start bind9.service
named itself starts and
runs fine ... it's actually a systemd (configuration?) problem!
And, how did I find that?  When the named process fails, it fails
because it's getting SIGTERM!!!:
4539  --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=1, si_uid=0} ---
This seems to consistently happen about 90 seconds after systemd/systemctl
"starts" (attempts to start) it.
And ...:
4689  kill(4690, SIGTERM)               = 0
(the only reason the two PIDs between that and the earlier above don't
match, is they were captured in separate runs).
It's systemd/systemctl that's sending the signal that's causing
bind9 (named) to shutdown - that's also 100% consistent with what the
logs shows, e.g. (shortening the timestamps to MM:SS):
51:42 balug-sf-lug-v2 named[5518]: resolver priming query complete
53:12 balug-sf-lug-v2 named[5518]: shutting down
53:12 balug-sf-lug-v2 named[5518]: stopping command channel on 127.0.0.1#953
53:12 balug-sf-lug-v2 named[5518]: stopping command channel on ::1#953
53:12 balug-sf-lug-v2 named[5518]: no longer listening on ::#53
53:12 balug-sf-lug-v2 named[5518]: no longer listening on 127.0.0.1#53
53:12 balug-sf-lug-v2 named[5518]: no longer listening on 192.168.122.245#53
53:12 balug-sf-lug-v2 named[5518]: exiting
So ... at this point I'm trying to figure out why systemd/systemctl
is SIGTERMing named - when it ought not.  I'm guestimating maybe
it tries to do some "health check", and does it improperly, and after
90 seconds "gives up" and SIGTERMs the PID.
I also notice:
# systemctl start bind9.service
... if done from terminal, that remains in the foreground the entire time
So seems systemd/systemctl is "waiting" for some check to pass before
"releasing", and instead times out waiting, gives up, and zaps the PID.

So, curious if any folks might know or have more clue(s) as to what
to look at and/or where to get down to the bottom of this
systemd/systemctl issue with bind9/named (also not seeing this issue
with any of the other services).


Other interesting bit ... (maybe just distraction / red herring):
/bin/systemd-tty-ask-password-agent
systemd/systemctl, done with interactive start from terminal,
fires up (forks (clone) and execs /bin/systemd-tty-ask-password-agent
with argument of --wait).  If I redirect stdin from /dev/null,
e.g.:
# </dev/null systemctl start bind9.service
I don't end up with the /bin/systemd-tty-ask-password-agent process
hanging out for the duration ... but even in that case, named still
gets SIGTERMed by systemd/systemctl right around 90 seconds after it's
been fired up.
Also, on details, systemd/systemctl sends SIGCONT immediately
before the SIGTERM ... but it's the SIGTERM that has everything going
sideways and TERMinates the running bind9/named.

Also, if folks are curious, here are some of the key bits
that allow bind9/named to function under chroot:
$ grep named.\*bind /etc/fstab
/dev/null /var/lib/named/dev/null none bind 0 0
/dev/random /var/lib/named/dev/random none bind 0 0
/run/named /var/lib/named/run/named none bind 0 0
/usr/share/dns /var/lib/named/usr/share/dns none bind 0 0
$
That, and some symlink(s), etc., and it works under chroot ...
and stuff that needs and ought interact with it, from outside of
chroot, all works and plays nice together (almost the same as
Debian 9 (stretch) ... just one more directory from /usr for
Debian 10 (buster)).  And with that infrastructure, it probably also
runs just fine outside of chroot too, without any changes ... but I
really don't want to be running it outside of the chroot.
Ah, what the heck, it's non-production, let's try ...
/etc/systemd/system/bind9.service.d/bind9.conf
ExecStart=/usr/sbin/named -f -u bind
# systemctl daemon-reload
# systemctl start bind9.service
... and still fails same way (again shortening the timestamps to MM:SS):
11:19 balug-sf-lug-v2 named[5991]: resolver priming query complete
12:49 balug-sf-lug-v2 named[5991]: shutting down
12:49 balug-sf-lug-v2 named[5991]: stopping command channel on 127.0.0.1#953
12:49 balug-sf-lug-v2 named[5991]: stopping command channel on ::1#953
12:49 balug-sf-lug-v2 named[5991]: no longer listening on ::#53
12:49 balug-sf-lug-v2 named[5991]: no longer listening on 127.0.0.1#53
12:49 balug-sf-lug-v2 named[5991]: no longer listening on 192.168.122.245#53
12:49 balug-sf-lug-v2 named[5991]: exiting
And if I do it sysvinit-like start, without chroot:
# PATH=/sbin:/bin:/usr/sbin:/usr/bin start-stop-daemon --start  
--oknodo --quiet --exec /usr/sbin/named --pidfile /run/named/named.pid  
-- -u bind
... it continues to stay up and running no problem, long past 90 seconds,
so appears it's not only not a chroot issue, but not even at all specific
to chroot.
FYI:
$ ls -l /etc/bind
lrwxrwxrwx 1 root root 25 Mar 15  2014 /etc/bind -> ../var/lib/named/etc/bind
$
Anyway, mostly that, and the bind mounts, and appropriate
permissions/ownerships, and it plays well in and/or out of chroot (alas,
probably the first time I fired it up outside of chroot in many years).




More information about the conspire mailing list