tegra21x watchdog module

Hello,

I’m using the Yocto layer to build my image for Jetson Nano and if the tegra_wdt_t210x module is loaded I’m getting a board reset. This appears to be due to the watchdog not being notified.

I’ve set the wd timeout in systemd to 120 seconds, I see Tegra module acks this, however I still get the board reset in about 20 seconds. This doesn’t happen with the Ubuntu sd-card image, so I’m guessing it could be due to a difference of how systemd pings /dev/watchdog? Or maybe there is a separate app that pings the watchdog in the ubuntu image?

Could you please tell me more about what could be the root cause? Thank you

[   11.397267] tegra-wdt 60005100.watchdog: Tegra WDT enabled on probe. Timeout = 120 seconds.
[   11.418422] tegra-wdt 60005100.watchdog: initialized (timeout = 120 sec, nowayout = 0)

   36.330274] Watchdog detected hard LOCKUP on cpu 0[   36.334940] ------------[ cut here ]------------
[   36.339595] WARNING: CPU: 3 PID: 0 at /usr/src/kernel/kernel/watchdog_hld.c:143 watchdog_check_hardlockup_other_cpu+0x108/0x128
[   36.351118] Modules linked in: ip6_tables ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink br_netfilter xt_owner nvgpu nvs tegra_wdt_t21x
[   36.365387] 
[   36.366885] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.9.140-l4t-r32.1+g0e2f66e #1
[   36.374573] Hardware name: jetson-nano (DT)
[   36.378773] task: ffffffc0fa7bc600 task.stack: ffffffc0fa7d0000
[   36.384718] PC is at watchdog_check_hardlockup_other_cpu+0x108/0x128
[   36.391098] LR is at watchdog_check_hardlockup_other_cpu+0x108/0x128
[   36.397476] pc : [<ffffff800817c878>] lr : [<ffffff800817c878>] pstate: 604001c5
[   36.404900] sp : ffffffc0ff019d00
[   36.408225] x29: ffffffc0ff019d00 x28: ffffffc0fa7bc600 
[   36.413581] x27: 0000000000000000 x26: 0000000000000000 
[   36.418936] x25: ffffff800a218000 x24: ffffff800a218fa8 
[   36.424289] x23: ffffff800a51f268 x22: ffffffc0ff01d6f8 
[   36.429640] x21: 0000000000000000 x20: ffffff800a219c30 
[   36.434993] x19: ffffff8009bed760 x18: ffffffffffffffff 
[   36.440347] x17: 000000000000569f x16: 0000000000000000 
[   36.445700] x15: ffffff800a217e10 x14: ffffffc17f0196c7 
[   36.451053] x13: ffffffc0ff0196ca x12: ffffffc0f95e4b60 
[   36.456406] x11: ffffff8008fe15d8 x10: 00000000ffffffff 
[   36.461760] x9 : 00000000000002f6 x8 : 206e6f2050554b43 
[   36.467113] x7 : 4f4c206472616820 x6 : ffffffc0ff019700 
[   36.472466] x5 : 0000000000000012 x4 : 0000000000000000 
[   36.477818] x3 : 0000000000000000 x2 : 000000000004098c 
[   36.483172] x1 : 0000000000000000 x0 : 0000000000000026 
[   36.488525] 
[   36.490021] ---[ end trace 5c17e2a8d4f30cf2 ]---
[   36.494658] Call trace:
[   36.497116] [<ffffff800817c878>] watchdog_check_hardlockup_other_cpu+0x108/0x128
[   36.504542] [<ffffff800817b9ec>] watchdog_timer_fn+0x9c/0x280
[   36.510314] [<ffffff800813e764>] __hrtimer_run_queues+0xe4/0x378

Could you have a check below code on your system.
Please run it with supervisor mode. (sudo)

#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>
#include <errno.h>
#include <sys/ioctl.h>
#include <linux/watchdog.h>

int main (void) {
	int fd, ret;
	int timeout = 0;
 
	/* open WDT0 device (WDT0 enables itself automatically) */
	fd = open("/dev/watchdog0", O_RDWR);
	if(fd < 0) {
		fprintf(stderr, "Open watchdog device failed!\n");
		return -1;
	}
	/* WDT0 is counting now,check the default timeout value */
	ret = ioctl(fd, WDIOC_GETTIMEOUT, &timeout);
	if(ret) {
		fprintf(stderr, "Get watchdog timeout value failed!\n");
		return -1;
	}
	fprintf(stdout, "Watchdog timeout value: %d\n", timeout);
 
	/* set new timeout value 60s */
	/* Note the value should be within [5, 1000] */
	timeout = 60;
	ret = ioctl(fd, WDIOC_SETTIMEOUT, &timeout);
	if(ret) {
		fprintf(stderr, "Set watchdog timeout value failed!\n");
		return -1;
	}
	fprintf(stdout, "New watchdog timeout value: %d\n", timeout);
 
	/*Kick WDT0, this should be running periodically */
	ret = ioctl(fd, WDIOC_KEEPALIVE, NULL);
	if(ret) {
		fprintf(stderr, "Kick watchdog failed!\n");
		return -1;
	}
}

Thank you ShaneCCC,

I think I figured out how it works. The module registers a handler internally to kick the watchdog inside the module from another kernel thread. This is done unless the userspace opens the device /dev/watchdog.

If any userspace daemon closes the device after a ping, even by doing magic close, the watchdog thinks the daemon died and reboots the board directly, without waiting for the timeout to occur.

In my case, using systemd to set the timeout to X seconds implies systemd opening and probably also closing the device, thus the module disables internal keepalive and then performed a reset.

Shane’s example is very good for the case where a userspace daemon is needed. The ubuntu sd-card image does not have any userspace daemon to open /dev/watchdog and the module does the keepalive internally.