[conspire] linuxmafia.com - host had kernel booboo (Oops); & (I manually) rebooted.

Michael Paoli michael.paoli at cal.berkeley.edu
Thu Nov 30 10:53:12 PST 2023


Rick, (& Cc: et. al.),

linuxmafia.com - host had kernel booboo (Oops); & (I manually) rebooted.

So, it had another booboo.  This time I also happened to ...
... had (and still have) screen session on guido (the physical host
upon which the linuxmafia(.com) VM guest resides)
to capture console (most notably output) from linuxmafia(.com) ... so
caught that
(hey, didn't have that on ye olde physical hardware ... unless someone
wanted to dust off the screen and take a photo with their phone ...
and probably couldn't scroll back to see before what was on that
literal physical screen earlier, either)
and also higher level(s) above that - my own screen session on
terminal ... I happened to also be close enough to that,
that my ears caught the terminal bell (yes, I have screen do that,
rather than just visual) ... so ... took a peek ...,
another kernel Oops, and I rebooted it
(virsh destroy linuxmafia (about the virtual equivalent of yanking the
power cord out from under the VM);
virsh start linuxmafia --console) ...
and likewise (at least the restart bit) again under screen on guido -
so again, anything that linuxmafia dumps to console,
so long as guido is up and alive and healthy, should also be seen
there (and if it doesn't roll out of buffer, continue to be
available there).  Anyway, I'll also note what I captured in the
relevant log file ...
linuxmafia:~root/Changelog ... so there and ... okay, added that there.

Anyway, some more detail, from linuxmafia:~root/Changelog:
And from physical host, then did:
virsh destroy linuxmafia
virsh start linuxmafia --console
And then we have ...:
linuxmafia:~# hostname; uptime; who am I; id
linuxmafia.com
 09:35:59 up 3 min,  1 user,  load average: 0.03, 0.12, 0.06
mpaoli   ttyS0        Nov 30 09:34
uid=0(root) gid=0(root) groups=0(root),102(Debian-exim)
linuxmafia:~#

And, bit more regarding the screen capture, as notably set up earlier:
From: Michael Paoli <michael.paoli at cal.berkeley.edu>
Date: Thu, Nov 23, 2023 at 2:02 AM
Subject: Re: guido ... linuxmafia.com ...
To: Rick Moen <rick at linuxmafia.com>

And ... this might be a bit handier for either of us to find (if we
remember it's there):
# hostname; id; screen -ls
guido
uid=0(root) gid=0(root) groups=0(root)
There is a screen on:
        7637.for_CONSOLE_of_linuxmafia.com      (11/23/23 09:33:52)
 (Detached)
1 Socket in /run/screen/S-root.
#
# screen -rx -S for_CONSOLE_of_linuxmafia.com
...
root at linuxmafia:~# hostname; id; tty
linuxmafia.com
uid=0(root) gid=0(root) groups=0(root),102(Debian-exim)
/dev/ttyS0
root at linuxmafia:~# echo -e 'console\r' >> /dev/console
console
root at linuxmafia:~#

This might also be easier to find/review ... e.g. if linuxmafia VM has
a kernel booboo (for whatever reason(s))
yet guido is still up and running and (at least mostly) okay.
And yes on Tuesday evening ... linuxmafia.com had a kernel Oops ...
turns out I had a screen session that captured that:

And also in that same earlier email:
> > > > > > > Might want to set up a "watchdog" or the like, so if/when it crashes
> > > > > > > or get wedged, it would then get automagically rebooted.
> > > > > > >
> > > > > > > Was thinking about that after the last such event ... might be time to
> > > > > > > nudge that up on the priorities.  Once upon a time did likewise on the
> > > > > > > VM where BALUG was ... turned out a bad bot was overwhelming the web
> > > > > > > server, that sucked up too much resource, and the host would wedge.
> > > > > > > Linux supports a software watchdog timer - that was useful for that
> > > > > > > ... but that might not suffice in case of kernel oops ... however,
> > > > > > > since it's virtual, physical host could potentially detect
> > > > > > > unresponsive VM guest, and force a reboot.  Anyway, BALUG ... way back
> > > > > > > then ... after isolating the issue, some tuning to the web server ...
> > > > > > > and that took care of the issue (don't let bad bot consume excess
> > > > > > > resources).

So, I'd also been thinking ... along those principles,
set up monitoring from guido (physical host), and if linuxmafia(.com) goes (too)
unresponsive, then just forcibly restart that VM.  Crude, but it would
make for significantly faster -
and automagic - recovery.
Only thing I was also thinking, though ... that would potentially
backfire or cause issues,
e.g. when one intentionally takes linuxmafia down to do some
appropriate maintenance ...
having the physical countermand that and such could be problematic ...
potentially even
quite so.
However, I was thinking further on that ... could work around that,
but would take wee bit more
work.  E.g. relative out-of-band communications between guest (VM) and
host (physical).
E.g., at least in notable part, a normal regular shutdown could mark
on some shared
storage (512 bytes of shared storage would more than suffice) could
have the guest
note that it was intentionally shutdown (or rebooted) from the guest,
and the host could
consult the data there.  Likewise guest could update it when booted -
so essentially
data there would reflect from guest, to host, the desired target
state.  That plus some other
safeguards and such (e.g. sufficient delay to allow guest to complete
a regular boot, etc.)
could possibly make a sufficiently good "solution" (/work-around).

And ... won't do the whole kernel Oops here, but from the most recent,
for folks wondering
more precisely when it went down, we have (and that VM itself uses local time):
root at linuxmafia:~# [658706.477889] BUG: unable to handle kernel NULL
pointer dereference at 000000a4
[658706.480408] IP: [<e08861b0>] e1000_clean+0x8d/0x40d [e1000]
[658706.480408] *pde = 00000000
[658706.480408] Oops: 0000 [#1] SMP
...
Message from syslogd at linuxmafia at Thu Nov 30 09:21:37 2023 ...
...
So ... was down a total of about 12ish minutes.



More information about the conspire mailing list