[conspire] linuxmafia.com: yesterday: another kernel Oops, today: added watchdog
Michael Paoli
Michael.Paoli at cal.berkeley.edu
Thu Dec 28 22:45:53 PST 2023
linuxmafia.com: So,:
yesterday: another kernel Oops:
Message from syslogd at linuxmafia at Wed Dec 27 09:46:26 2023 ...
linuxmafia kernel: [1182265.672051] Oops: 0000 [#1] SMP
[1182267.004666] [<c11cbbc0>] ? sock_sendmsg+0x96/0xae
[1182267.014521] [<c10437d6>] ? autoremove_wake_function+0x0/0x2d
[1182267.029253] [<c11f3b94>] ? ip_route_output_flow+0x71/0x1ac
[1182267.038502] [<c113a993>] ? copy_from_user+0x27/0x10e
[1182267.048242] [<c11d2f4c>] ? verify_iovec+0x3e/0x6e
[1182267.057662] [<c11cbd35>] ? sys_sendmsg+0x15d/0x1c5
[1182267.067682] [<c11cc65c>] ? sys_connect+0x7b/0xb2
[1182267.094435] [<c11cc66f>] ? sys_connect+0x8e/0xb2
[1182267.100119] [<c10437d6>] ? autoremove_wake_function+0x0/0x2d
[1182267.114698] [<c10b225a>] ? fsnotify_modify+0x5a/0x61
[1182267.131592] [<c11ccf0f>] ? sys_socketcall+0x171/0x1aa
[1182267.142669] [<c10b2f49>] ? sys_write+0x5b/0x63
[1182267.150459] [<c10030fb>] ? sysenter_do_call+0x12/0x28
Rick noticed and was able to get to it and get it rebooted by ...
$ hostname && (date -d "now - "$(awk '{print $1;}' < /proc/uptime)"
seconds")
linuxmafia.com
Wed Dec 27 19:53:52 PST 2023
$
today: (I) added watchdog:
2023-12-28
linuxmafia VM (this host), (virtual) hardware change, added:
<watchdog model='i6300esb'/>
That will give (virtual) hardware watchdog device after next
reboot (may require full (virtual) power down of VM to activate change).
installed watchdog version 5.12-1
Configured watchdog as follows, notably changing from the defaults to:
--- /etc/default/watchdog.default 2023-12-28 22:02:22.000000000
-0800
+++ /etc/default/watchdog 2023-12-28 22:11:09.000000000 -0800
@@ -4 +4 @@
-watchdog_module="none"
+watchdog_module=i6300esb
--- /etc/watchdog.conf.default 2012-04-05 03:16:33.000000000 -0700
+++ /etc/watchdog.conf 2023-12-28 22:10:12.000000000 -0800
@@ -23 +23 @@
-#watchdog-device = /dev/watchdog
+watchdog-device = /dev/watchdog
So, once a full (virtual) powerdown and reboot is done, the watchdog
then should be fully effective, and should, e.g. generally automagically
recover from kernel Oops (also tested on linuxmafia2 VM) with the
(virtual) hardware doing a (virtual) reset after 30s of non-response.
So, not exactly a fix, but after next (virutal) power down and reboot,
should be a pretty effective work-around to reduce downtime from kernel
Oops within the VM itself, and can likewise be configured to handle
other issues (e.g. load too high, or failure of specific check(s)).
More information about the conspire
mailing list