[conspire] linuxmafia.com: yesterday: another kernel Oops, today: added watchdog

Thu Dec 28 22:45:53 PST 2023

linuxmafia.com: So,:

yesterday: another kernel Oops:
Message from syslogd at linuxmafia at Wed Dec 27 09:46:26 2023 ...
linuxmafia kernel: [1182265.672051] Oops: 0000 [#1] SMP
[1182267.004666]  [<c11cbbc0>] ? sock_sendmsg+0x96/0xae
[1182267.014521]  [<c10437d6>] ? autoremove_wake_function+0x0/0x2d
[1182267.029253]  [<c11f3b94>] ? ip_route_output_flow+0x71/0x1ac
[1182267.038502]  [<c113a993>] ? copy_from_user+0x27/0x10e
[1182267.048242]  [<c11d2f4c>] ? verify_iovec+0x3e/0x6e
[1182267.057662]  [<c11cbd35>] ? sys_sendmsg+0x15d/0x1c5
[1182267.067682]  [<c11cc65c>] ? sys_connect+0x7b/0xb2
[1182267.094435]  [<c11cc66f>] ? sys_connect+0x8e/0xb2
[1182267.100119]  [<c10437d6>] ? autoremove_wake_function+0x0/0x2d
[1182267.114698]  [<c10b225a>] ? fsnotify_modify+0x5a/0x61
[1182267.131592]  [<c11ccf0f>] ? sys_socketcall+0x171/0x1aa
[1182267.142669]  [<c10b2f49>] ? sys_write+0x5b/0x63
[1182267.150459]  [<c10030fb>] ? sysenter_do_call+0x12/0x28

Rick noticed and was able to get to it and get it rebooted by ...
$ hostname && (date -d "now - "$(awk '{print $1;}' < /proc/uptime)" 
seconds")
linuxmafia.com
Wed Dec 27 19:53:52 PST 2023
$

today: (I) added watchdog:
2023-12-28
linuxmafia VM (this host), (virtual) hardware change, added:
<watchdog model='i6300esb'/>
That will give (virtual) hardware watchdog device after next
reboot (may require full (virtual) power down of VM to activate change).
installed watchdog version 5.12-1
Configured watchdog as follows, notably changing from the defaults to:

--- /etc/default/watchdog.default       2023-12-28 22:02:22.000000000 
-0800
+++ /etc/default/watchdog       2023-12-28 22:11:09.000000000 -0800
@@ -4 +4 @@
-watchdog_module="none"
+watchdog_module=i6300esb
--- /etc/watchdog.conf.default  2012-04-05 03:16:33.000000000 -0700
+++ /etc/watchdog.conf  2023-12-28 22:10:12.000000000 -0800
@@ -23 +23 @@
-#watchdog-device       = /dev/watchdog
+watchdog-device        = /dev/watchdog
So, once a full (virtual) powerdown and reboot is done, the watchdog
then should be fully effective, and should, e.g. generally automagically
recover from kernel Oops (also tested on linuxmafia2 VM) with the
(virtual) hardware doing a (virtual) reset after 30s of non-response.

So, not exactly a fix, but after next (virutal) power down and reboot,
should be a pretty effective work-around to reduce downtime from kernel
Oops within the VM itself, and can likewise be configured to handle
other issues (e.g. load too high, or failure of specific check(s)).