[conspire] solved: Re: systemd 8-O ; -) ... bind9 chroot Debian 9 (stretch) --> Debian 10 (buster)

Michael Paoli Michael.Paoli at cal.berkeley.edu
Sat Apr 18 17:46:51 PDT 2020


Okay, got it solved.

Turned out to be relatively simple ... but I missed it on earlier passes.
The fix:
# rm /etc/systemd/system/bind9.service.d/bind9.conf
# systemctl daemon-reload

Bit 'o background,
the earlier:
/etc/systemd/system/bind9.service.d/bind9.conf
was put in place - and necessary - to do certain overrides
for systemd launching bind9 ... notably for the chroot.
However, now the unit file for bind9
/lib/systemd/system/bind9.service
had changed how it was
calling / expecting to launch bind9.  Notably:
[Service]
Type=forking
So, that was conflicting with how
/etc/systemd/system/bind9.service.d/bind9.conf
was firing up bind9's named.
Also, the newer
/lib/systemd/system/bind9.service
picks up the configuration bits in
/etc/default/bind9
(looks like the older did too ... but probably
the much older did not)
which also has the customizations for chroot,
so the (custom)
/etc/systemd/system/bind9.service.d/bind9.conf
was no longer needed at all.
Still not sure why it earlier didn't work
when I removed the -f option in
/etc/systemd/system/bind9.service.d/bind9.conf
and I think I even tried removing that file entirely,
but I might've missed the
# systemctl daemon-reload
step earlier on some of those attempts.
In any case, cleanest "fix" for it:
# rm /etc/systemd/system/bind9.service.d/bind9.conf
# systemctl daemon-reload
(well ... notwithstanding totally gutting systemd ;-))
https://wiki.debian.org/Bind9
is also pretty good, but it could do with some more updating (which
I may likely get around to if someone doesn't beat me to it).

> From: "Michael Paoli" <Michael.Paoli at cal.berkeley.edu>
> Subject: systemd 8-O ;-) ... bind9 chroot Debian 9 (stretch) -->  
> Debian 10 (buster)
> Date: Sat, 18 Apr 2020 04:03:26 -0700

> So, ... hitting a systemd issue I'd like to figure out and get resolved.
> Yeah, I know, systemd, ugh ... but despite my also not much liking it,
> if reasonably feasible, want to see if I can get this issue resolved.
> So, bit 'o background:
>
> So, ... working on (near) clone (balugclone) of system (balug).
> Near?  As in starting about identical, then mostly changing "just
> enough" (
> clone:
>     different Ethernet MAC address
>     (before even first booting) down interface link:
>     (
>     link=down; mac=52:54:00:67:20:40
>     virsh domif-setlink balugclone "$mac" "$link" --config
>     virsh domif-setlink balugclone "$mac" "$link"
>     virsh domif-getlink balugclone "$mac" --config
>     virsh domif-getlink balugclone "$mac"
>     )
>     change network from bridged to default (RFC-1918 + NAT/SNAT)
>     stop and disable potential conflicting services:
>     systemctl stop & systemctl disable:
>     mailman.service
>     exim4.service
>     apache2.service
>     spamassassin.service
>     rsync.service
>     mariadb.service
>     bind9.service
>     ...
> )
> to avoid conflicts with the running production balug
> Virtual Machine (VM) and its data, etc.
> And, what for?  Do a pre-production Debian 9 (stretch) --> 10 (buster)
> upgrade, to be able to plan for and have (theoretically) smooth actual
> production upgrade.  Alas, last time around, wasn't quite thorough
> enough:
> https://lists.balug.org/pipermail/balug-admin/2020-February/001018.html
>
> Anyway, this time, fair bit more progress (yea!) (notably working
> through sanity checks of at least basic functionality of important services).
>
> But alas, still bumping into one gottcha I've not yet found a fix for.
> And, yup, systemd specific.
>
> So, nameserver - running BIND9 under chroot.
> If I fire it up manually, in manner that sysvinit would were it present:
> # PATH=/sbin:/bin:/usr/sbin:/usr/bin start-stop-daemon --start --oknodo \
>   --quiet --exec /usr/sbin/named --pidfile /run/named/named.pid -- \
>   -u bind -t /var/lib/named
> Then all appears fine, it runs fine, functions, keeps working, etc.
> (note to safely test it on clone, also:
> clone:
>     /etc/network/interfaces disable interfaces except lo and change eth0
>         to inet dhcp
>     (eth0 & relevant configs later becomes ens3 through the upgrade)
>     shutdown
>     up interface link:
>     (link=up; mac=52:54:00:67:20:40
>     virsh domif-setlink balugclone "$mac" "$link" --config
>     virsh domif-setlink balugclone "$mac" "$link"
>     virsh domif-getlink balugclone "$mac" --config
>     virsh domif-getlink balugclone "$mac"
>     )
>     boot
>     and before enabling and attempting to (re)start bind9:
>     bind9 all notify off (no)
>     comment out notify-source and notify-source-v6
> )
>
> But alas, when started under systemd with:
> # systemctl start bind9.service
> Things go kind'a funky ... and fail in fairly short order.
> First of all, as far as I can tell, from both systemd config,
> and also looking at process arguments and such, looks like bind9
> fires up properly under chroot in either case.
> From: /etc/systemd/system/bind9.service.d/bind9.conf
> we have:
> ExecStart=/usr/sbin/named -f -u bind -t /var/lib/named
>
> Also, without that -f option there (and after:
> # systemctl daemon-reload
> )
> it then effectively doesn't (as far as systemd/systemctl is concerned)
> work at all, failing quite immediately with:
> systemd[1]: bind9.service: Control process exited, code=exited,  
> status=1/FAILURE
> ... even though bind9/named is and continues to run fine in that case ...
> but the systemd/systemctl status is all wrong, as it thinks it failed,
> so, need the -f option.  Anyway, back to with -f (foreground) option:
>
> And ... smoking gun ... strace(1).
> It looks like in both cases (manual sysvinit-like start, or
> systemd:
> # systemctl start bind9.service
> named itself starts and
> runs fine ... it's actually a systemd (configuration?) problem!
> And, how did I find that?  When the named process fails, it fails
> because it's getting SIGTERM!!!:
> 4539  --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=1, si_uid=0} ---
> This seems to consistently happen about 90 seconds after systemd/systemctl
> "starts" (attempts to start) it.
> And ...:
> 4689  kill(4690, SIGTERM)               = 0
> (the only reason the two PIDs between that and the earlier above don't
> match, is they were captured in separate runs).
> It's systemd/systemctl that's sending the signal that's causing
> bind9 (named) to shutdown - that's also 100% consistent with what the
> logs shows, e.g. (shortening the timestamps to MM:SS):
> 51:42 balug-sf-lug-v2 named[5518]: resolver priming query complete
> 53:12 balug-sf-lug-v2 named[5518]: shutting down
> 53:12 balug-sf-lug-v2 named[5518]: stopping command channel on 127.0.0.1#953
> 53:12 balug-sf-lug-v2 named[5518]: stopping command channel on ::1#953
> 53:12 balug-sf-lug-v2 named[5518]: no longer listening on ::#53
> 53:12 balug-sf-lug-v2 named[5518]: no longer listening on 127.0.0.1#53
> 53:12 balug-sf-lug-v2 named[5518]: no longer listening on 192.168.122.245#53
> 53:12 balug-sf-lug-v2 named[5518]: exiting
> So ... at this point I'm trying to figure out why systemd/systemctl
> is SIGTERMing named - when it ought not.  I'm guestimating maybe
> it tries to do some "health check", and does it improperly, and after
> 90 seconds "gives up" and SIGTERMs the PID.
> I also notice:
> # systemctl start bind9.service
> ... if done from terminal, that remains in the foreground the entire time
> So seems systemd/systemctl is "waiting" for some check to pass before
> "releasing", and instead times out waiting, gives up, and zaps the PID.
>
> So, curious if any folks might know or have more clue(s) as to what
> to look at and/or where to get down to the bottom of this
> systemd/systemctl issue with bind9/named (also not seeing this issue
> with any of the other services).
>
>
> Other interesting bit ... (maybe just distraction / red herring):
> /bin/systemd-tty-ask-password-agent
> systemd/systemctl, done with interactive start from terminal,
> fires up (forks (clone) and execs /bin/systemd-tty-ask-password-agent
> with argument of --wait).  If I redirect stdin from /dev/null,
> e.g.:
> # </dev/null systemctl start bind9.service
> I don't end up with the /bin/systemd-tty-ask-password-agent process
> hanging out for the duration ... but even in that case, named still
> gets SIGTERMed by systemd/systemctl right around 90 seconds after it's
> been fired up.
> Also, on details, systemd/systemctl sends SIGCONT immediately
> before the SIGTERM ... but it's the SIGTERM that has everything going
> sideways and TERMinates the running bind9/named.
>
> Also, if folks are curious, here are some of the key bits
> that allow bind9/named to function under chroot:
> $ grep named.\*bind /etc/fstab
> /dev/null /var/lib/named/dev/null none bind 0 0
> /dev/random /var/lib/named/dev/random none bind 0 0
> /run/named /var/lib/named/run/named none bind 0 0
> /usr/share/dns /var/lib/named/usr/share/dns none bind 0 0
> $
> That, and some symlink(s), etc., and it works under chroot ...
> and stuff that needs and ought interact with it, from outside of
> chroot, all works and plays nice together (almost the same as
> Debian 9 (stretch) ... just one more directory from /usr for
> Debian 10 (buster)).  And with that infrastructure, it probably also
> runs just fine outside of chroot too, without any changes ... but I
> really don't want to be running it outside of the chroot.
> Ah, what the heck, it's non-production, let's try ...
> /etc/systemd/system/bind9.service.d/bind9.conf
> ExecStart=/usr/sbin/named -f -u bind
> # systemctl daemon-reload
> # systemctl start bind9.service
> ... and still fails same way (again shortening the timestamps to MM:SS):
> 11:19 balug-sf-lug-v2 named[5991]: resolver priming query complete
> 12:49 balug-sf-lug-v2 named[5991]: shutting down
> 12:49 balug-sf-lug-v2 named[5991]: stopping command channel on 127.0.0.1#953
> 12:49 balug-sf-lug-v2 named[5991]: stopping command channel on ::1#953
> 12:49 balug-sf-lug-v2 named[5991]: no longer listening on ::#53
> 12:49 balug-sf-lug-v2 named[5991]: no longer listening on 127.0.0.1#53
> 12:49 balug-sf-lug-v2 named[5991]: no longer listening on 192.168.122.245#53
> 12:49 balug-sf-lug-v2 named[5991]: exiting
> And if I do it sysvinit-like start, without chroot:
> # PATH=/sbin:/bin:/usr/sbin:/usr/bin start-stop-daemon --start  
> --oknodo --quiet --exec /usr/sbin/named --pidfile  
> /run/named/named.pid -- -u bind
> ... it continues to stay up and running no problem, long past 90 seconds,
> so appears it's not only not a chroot issue, but not even at all specific
> to chroot.
> FYI:
> $ ls -l /etc/bind
> lrwxrwxrwx 1 root root 25 Mar 15  2014 /etc/bind -> ../var/lib/named/etc/bind
> $
> Anyway, mostly that, and the bind mounts, and appropriate
> permissions/ownerships, and it plays well in and/or out of chroot (alas,
> probably the first time I fired it up outside of chroot in many years).




More information about the conspire mailing list