TX2 Watchdog Functionality

Hi All,

A couple of questions on the watchdog functionality (L4T 28.2). First, I was trying to experiment with the watchdog. My understanding is running:

sudo tail -f /dev/watchdog

Should enable the watchdog and eventually it times out and the system restarts. However, when running this command (on the nvidia user) I get the following output:

tail: error reading '/dev/watchdog': Invalid argument
tail: '/dev/watchdog' has become accessible
tail: /dev/watchdog: cannot seek to offset 0: Illegal seek

Any thoughts on this?

Second question: We have installed several TX2 units in an industrial application. In general they are working well for extended periods of time, however, occasionally we find that they are simply off. Power to the unit is still there, but we cannot SSH, reach our local webserver and after a power cycle there are no local logs from the time they were off. Our theory is that the site power is not always consistent and at some point the unit browns-out. We are looking add adding a UPS to prevent this, but I am wondering if the watchdog can be used to help in this circumstance?

Cheers
Ian

You will find some useful docs in the kernel source. Within the kernel, look for:

Documentation/watchdog/watchdog-api.txt

This provides a sample program, just copy this to the TX2 and build it:

Documentation/watchdog/src/*
# Compile:
make watchdog-simple
sudo ./watchdog-simple

There may be more restrictions or some changes between older 3.x kernels and the newer 4.x kernels, but I haven’t actually looked to see what/when the changes occurred.

Thanks linuxdev, this works to let me restart via watchdog. Any thoughts on the second question above? Is there any mechanism in place to allow recovery from brown-out conditions (ie. conditions where the OS isn’t actually running, but there is power to the system).

Cheers

The usual recipe for brownout is “don’t let it happen”. You could have custom hardware to monitor the line condition and act as an extension to the regular watchdog software. Perhaps you could load some area in memory with alternating patterns (e.g., 0xaa, then 0x55, then 0x00, then 0xff) and if the memory read back does not match a few seconds later, then consider it a reason to not trigger the watchdog “stop reset” (memory corruption is perhaps the most sensitive part of the system when it comes to brownouts). But if the system is locked, or parts of it crashed, then this wouldn’t help anyway (you’d need external hardware…the hardware would accept a heartbeat from the Jetson to avoid reboot).

If it is really critical, then you should consider external hardware to monitor heartbeat.

Thanks again for the valuable input. Based on our conditions I think the path forward is adding a UPS (along the strategy of ‘don’t let it happen’) and also looking into an external hardware watchdog controlling a relay on the input power.

Cheers