[conspire] linuxmafia.com watchdog activated: Re: linuxmafia.com: yesterday: another kernel Oops, today: added watchdog

Michael Paoli Michael.Paoli at cal.berkeley.edu
Sun Jan 21 14:46:59 PST 2024


So, this past Tuesday 2024-01-16,
during the BALUG.org[1] meeting[2],
did a shutdown (and virtual power down) and
reboot of linuxmafia.com, thus activating the
watchdog timer that had been put in place and
configured earlier (the virtual power down was
needed for the Virtual Machine (VM) to apply the
hardware changes for running config, as that (virtual) hardware
change couldn't be hot added).

This generally should help with faster recovery / less downtime from
kernel Oops or the like, notably, host goes unresponsive for
more than 30 seconds, it gets a swift kick in the (virtual) reset
by the (virtual) watchdog timer hardware.

Also at that same BALUG.org meeting, did also have live demo of
watchdog reset (forced kernel Oops on another VM, and watched it
do its thing, bringing the host up again in timely manner
automagically), and interesting discussions on watchdog and
troubleshooting/isolation of intermittent issues.

1. https://www.balug.org/
2. 
https://lists.balug.org/pipermail/balug-announce/2024-January/000339.html

On 2023-12-28 22:45, Michael Paoli wrote:
> linuxmafia.com: So,:
> 
> yesterday: another kernel Oops:
> Message from syslogd at linuxmafia at Wed Dec 27 09:46:26 2023 ...
> linuxmafia kernel: [1182265.672051] Oops: 0000 [#1] SMP
> [1182267.004666]  [<c11cbbc0>] ? sock_sendmsg+0x96/0xae
> [1182267.014521]  [<c10437d6>] ? autoremove_wake_function+0x0/0x2d
> [1182267.029253]  [<c11f3b94>] ? ip_route_output_flow+0x71/0x1ac
> [1182267.038502]  [<c113a993>] ? copy_from_user+0x27/0x10e
> [1182267.048242]  [<c11d2f4c>] ? verify_iovec+0x3e/0x6e
> [1182267.057662]  [<c11cbd35>] ? sys_sendmsg+0x15d/0x1c5
> [1182267.067682]  [<c11cc65c>] ? sys_connect+0x7b/0xb2
> [1182267.094435]  [<c11cc66f>] ? sys_connect+0x8e/0xb2
> [1182267.100119]  [<c10437d6>] ? autoremove_wake_function+0x0/0x2d
> [1182267.114698]  [<c10b225a>] ? fsnotify_modify+0x5a/0x61
> [1182267.131592]  [<c11ccf0f>] ? sys_socketcall+0x171/0x1aa
> [1182267.142669]  [<c10b2f49>] ? sys_write+0x5b/0x63
> [1182267.150459]  [<c10030fb>] ? sysenter_do_call+0x12/0x28
> 
> Rick noticed and was able to get to it and get it rebooted by ...
> $ hostname && (date -d "now - "$(awk '{print $1;}' < /proc/uptime)" 
> seconds")
> linuxmafia.com
> Wed Dec 27 19:53:52 PST 2023
> $
> 
> today: (I) added watchdog:
> 2023-12-28
> linuxmafia VM (this host), (virtual) hardware change, added:
> <watchdog model='i6300esb'/>
> That will give (virtual) hardware watchdog device after next
> reboot (may require full (virtual) power down of VM to activate 
> change).
> installed watchdog version 5.12-1
> Configured watchdog as follows, notably changing from the defaults to:
> --- /etc/default/watchdog.default       2023-12-28 22:02:22.000000000 
> -0800
> +++ /etc/default/watchdog       2023-12-28 22:11:09.000000000 -0800
> @@ -4 +4 @@
> -watchdog_module="none"
> +watchdog_module=i6300esb
> --- /etc/watchdog.conf.default  2012-04-05 03:16:33.000000000 -0700
> +++ /etc/watchdog.conf  2023-12-28 22:10:12.000000000 -0800
> @@ -23 +23 @@
> -#watchdog-device       = /dev/watchdog
> +watchdog-device        = /dev/watchdog
> So, once a full (virtual) powerdown and reboot is done, the watchdog
> then should be fully effective, and should, e.g. generally 
> automagically
> recover from kernel Oops (also tested on linuxmafia2 VM) with the
> (virtual) hardware doing a (virtual) reset after 30s of non-response.
> 
> So, not exactly a fix, but after next (virutal) power down and reboot,
> should be a pretty effective work-around to reduce downtime from kernel
> Oops within the VM itself, and can likewise be configured to handle
> other issues (e.g. load too high, or failure of specific check(s)).



More information about the conspire mailing list