Watchdog Timeout

Hello,

This may be a trivial question, but how do i configure the watchdog timeout on the AGX Dev Kit ?

I have a remote dev kit running the latest version of jetpack deep in the woods tracking deer, there are times where the machine runs out of memory and becomes unreachable. I usually have to travel all the way into the woods to go push the reset button, which can be annoying.

I thought about purchasing one of these Programmable Powerstrips, but it seems like overkill.

Can i configure the watchdog to automatically reset the jetson when it runs out of memory, and/or becomes unreachable ? If so, what should i do ?

Current watchdog function default enable and the time is 120s while the system hang it will reset the system, however if OOM that would reset the system.

Hello Shane,

Are there any configuration options for the watchdog ? The default settings are not resetting my jetson once it becomes unresponsive.

Side Question: Does overcommiting memory affect the watchdog ? I currently have a 16gb swap file, and i overcommit the memory. I’ve noticed the system becomes unresponsive if both the RAM and swap file become full.

Try below command to verify watchdog reset function.
For your case it could be the system didn’t dead to trigger the WDT to reset the system.

root@t186_int:/proc/sys/kernel # echo 0 > panic
root@t186_int:/proc/sys/kernel # cat panic
	0

crash the system using
	echo c>/proc/sysrq-trigger

I have ran those commands,

The device freezes and resets (as expected). However it then boots into emergency mode.

Where I am then unable to ssh into it. Like before, i need to go press the reset button to connect.

Is this normal reset behavior for the watchdog ?

What’s your version?

cat /etc/nv_tegra_release

Here is the output:

R32 (release), REVISION: 6.1, GCID: 27863751, BOARD: t186ref, EABI: aarch64, DATE: Mon Jul 26 19:36:31 UTC 2021

Any customized for your system?
I just confirm my AGX without problem with same BSP.

I just tested the same command on a Jetson Xavier NX and the same issue occurs. It also has the same BSP as my AGX.

The only thing in common between the two is a 500gb nvme m2 pcie ssd, the swap file is located on the nvme SSD for both devices.

I noticed that during the boot process ( post watchdog reboot ) they both failed a nvme check, and failed to mount the local file system.

Could this be the cause of the emergency mode ?

Okay, so i can confirm the problem is with either the nvme SSD or the swap file.

I have a second Jetson NX that i just tested. It does not have the SSD, nor a swap file and it was able to reboot without going into emergency mode.

I have the same SSD on my NX and AGX

It is a SP M.2 PCIe Gen 3 SSD. Model: A80 512GB

Update:

I flashed a spare AGX to the newest version of jetpack, added a samsung nvme SSD, then used the command.

echo c>/proc/sysrq-trigger

The result is emergency mode.

Oddly, on all three machines i can reboot from the command line, without issues. So it has something to do with the watchdog restarting the system, while a nvme ssd is mounted.

write an application which ping watchdog every 100 sec

cat /proc/watchdog

Once you do that, now its userspace responsibility to keep reloading watchdog counter every 100 sec.
If userspace is hanged, system will automatically reset.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.