Could not trigger watchdog hard reset under Guest OS

juxue.tang1 · June 2, 2023, 1:53am

We tried to trigger a watchdog hard reset for test purpose, but nothing happened after we opened /dev/watchdog0

by command:
echo 1 | sudo tee /dev/watchdog0

and by c code using ioctl:

include <stdio.h>
include <fcntl.h> // for watchdog timer
include <unistd.h> // needed only if close() is used to close watchdog timer
include <sys/ioctl.h> // for watchdog timer
include <linux/watchdog.h> // for watchdog timer

int main() {

int fd, ret;
int timeout = 0;

/* open WDT0 device (WDT0 enables itself automatically) */
fd = open(“/dev/watchdog0”, O_RDWR);
if (fd<0) {
fprintf(stderr, “Open watchdog device failed!\n”);
return -1;
}

/* WDT0 is counting now,check the default timeout value */
ret = ioctl(fd, WDIOC_GETTIMEOUT, &timeout);
if(ret) {
fprintf(stderr, “Get watchdog timeout value failed!\n”);
return -1;
}
fprintf(stdout, “Watchdog timeout value: %d\n”, timeout);

/* set new timeout value 60s /
/ Note the value should be within [5, 1000] */
timeout = 20;
ret = ioctl(fd, WDIOC_SETTIMEOUT, &timeout);
if(ret) {
fprintf(stderr, “Set watchdog timeout value failed!\n”);
return -1;
}
fprintf(stdout, “New watchdog timeout value: %d\n”, timeout);

/*Kick WDT0, this should be running periodically */
ret = ioctl(fd, WDIOC_KEEPALIVE, NULL);
if(ret) {
fprintf(stderr, “Kick watchdog failed!\n”);
return -1;
}
sleep(1);

/* close WDT0 device */
close(fd);
if (ret<0) {
fprintf(stderr, “Failed to close watchdog device.”);
return -1;
}

return 0;

}

The c code above successfully set the timeout to a new value(102 to 20):

Jan 31 10:19:23 tegra-ubuntu sudo[3417]: nvidia : TTY=pts/0 ; PWD=/home/nvidia ; USER=root ; COMMAND=./wdt
Jan 31 10:19:23 tegra-ubuntu kernel: tegra_wdt_t18x 2190000.watchdog: Watchdog(0): wdt timeout set to 20 sec
Jan 31 10:19:39 tegra-ubuntu kernel: watchdog: watchdog0: watchdog did not stop!

kernel message suggest the tegra_wdt_t18x start successfully but tegra_hv_wdt is not working:

Jan 29 08:42:46 tegra-ubuntu kernel: tegra_wdt_t18x 2190000.watchdog: shutdown timeout disabled
Jan 29 08:42:46 tegra-ubuntu kernel: tegra_wdt_t18x 2190000.watchdog: Tegra WDT init timeout = 120 sec
Jan 29 08:42:46 tegra-ubuntu kernel: tegra_wdt_t18x 2190000.watchdog: Registered successfully
Jan 29 08:42:48 tegra-ubuntu kernel: tegra_hv_wdt tegra_hv_wdt: failed to find ivc property
Jan 29 08:42:48 tegra-ubuntu kernel: tegra_hv_wdt tegra_hv_wdt: failed to parse device tree

However, after waiting for the timeout, nothing happened, system reset was not triggered.
Is there another process/service/drive kicking the watchdog at the same time? Or maybe other VMs are also kicking the same watchdog?
Is there another way to trigger the hard reset?

Please provide the following info (tick the boxes after creating this topic):
Software Version
DRIVE OS 6.0.6
DRIVE OS 6.0.5
DRIVE OS 6.0.4 (rev. 1)
DRIVE OS 6.0.4 SDK
other

Target Operating System
Linux
QNX
other

Hardware Platform
DRIVE AGX Orin Developer Kit (940-63710-0010-D00)
DRIVE AGX Orin Developer Kit (940-63710-0010-C00)
DRIVE AGX Orin Developer Kit (not sure its number)
other

SDK Manager Version
1.9.2.10884
other

Host Machine Version
native Ubuntu Linux 20.04 Host installed with SDK Manager
native Ubuntu Linux 20.04 Host installed with DRIVE OS Docker Containers
native Ubuntu Linux 18.04 Host installed with DRIVE OS Docker Containers
other

VickNV · June 2, 2023, 3:26am

Please refer to How to Identify System Part Number (P/N) and share us the P/N of your devkit.

juxue.tang1 · June 2, 2023, 3:57am

part number is : P/N 940-63710-0010-200

juxue.tang1 · June 2, 2023, 5:19am

I try to read the raw watchdog timer status register of WDT0 WDTSR directly, by command:

sudo busybox devmem 0x02190004 32

The WDTSR register was read 0x00000011 before I started watchdog:
which is b0000 0000 0000 0000 0000 0000 0001 0001
according to register table, indicating the watchdog is enabled(bit 0) and current countdown(bit 11:4) is 1

After I opened watchdog:
the value changes and finally goes to 0x0001501F:
which is b0000 0000 0000 0001 0101 0000 0001 1111
It appears Local interrupt (bit1), Local FIQ (bit2), Remote Interrupt (bit2) have all been triggered,
and Current Expiration Count (bit 14:12) already reaches 5, which is SYSTEM_RESET
also Current Error (bit16) become 1, what does this mean? Maybe the hard reset is disabled or bypassed?

VickNV · June 2, 2023, 10:22pm

Are you working with the DRIVE OS SDK or PDK? Additionally, I’m curious to know where you obtained the register information you mentioned. It would be helpful to understand the context and source of these details.

juxue.tang1 · June 6, 2023, 4:38am

I am currently working with SDK, I’m not sure about the source of register information as I was told by someone else.

VickNV · June 6, 2023, 5:12am

Does your company have any PDK access or nvonline access?
Please provide more information about the person who shared the register information with you through a personal message, if applicable. This will help us understand the context and provide you with better support. Thank you.

juxue.tang1 · June 6, 2023, 6:34am

The register information was read from “technical reference manual, NVIDIA Orin Series System-on-Chip” document, Version: ALPHA, Date: 21-MAY-2021, ID: DP-10508-001

VickNV · June 6, 2023, 3:50pm

It seems that your company has nvonline access. Have you tried to get the support there?

system · July 11, 2023, 2:19am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.