Jetson Orin AGX freezing in field needs to be powercycled

Jetpack :Version: 5.1.1-b56
nvidia-l4t-core 35.3.1-20230319081403
Jetson Orin AGX Developerkit : 32GB or 64 GB
The device froze and had to be powercycled. Are there any kernel level or watchdog solutions to prevent this from happening.

Hi himica.khurana,

Do you know the reason of the freeze?

There should be watchdog enabled by default.
Please share the full dmesg for further check.

we are not aware are there any commands you would recommend to debug?

dmesg-logs.txt (145.6 KB)

I want to check the dmesg from boot up.

You can also run the following command to check if the watchdog is enabled.

$ cat /proc/device-tree/watchdog@2190000/status

I would suggest that you could find the reason of freeze.

this just says okay

cat /proc/device-tree/watchdog@2190000/timeout-sec
Response:
x
does x mean not configured?

It should be configured as 120s in device tree by default.

timeout-sec = <120>;

Please share the dmesg to confirm if it is intialized correctly.

how do I configure it to be 120, can you share the command?
when I run cat /proc/device-tree/watchdog@2190000/timeout-sec
it prints x

here is the dmesg logs

Youā€™ll need to attach the logs again, they donā€™t seem to be present. Incidentally, the logs just before the system goes down is most important, although the reboot would also be useful. You might get something with ssh or other login, but serial console would give more information more reliably (I suppose in the field you canā€™t leave another computer running, but even something as simple as an RPi or another Jetson can log from serial console).

I donā€™t see your dmesg file here. Maybe you upload it failed, please share it again.

dmesg-logs.txt (145.6 KB)

uploaded again. Thank you

[1740680.142843] device vethe553490 left promiscuous mode
[1740680.142849] br-1c451ff65de3: port 1(vethe553490) entered disabled state
[1742020.609411] hot-surface-alert cooling state: 1 -> 0

I would like to check the log during boot up.
Please run sudo reboot and share the dmesg again.

I also found that it seems the high temperature in your case. Is it the cause of the freezing?

It is in the field. is it worth trying to reboot from ssh?
yes, devices are in high temperature areas.

do you recommend rebooting the devices at intervals?

Yes, we would need to check the dmesg during boot up to confirm if watchdog has been enabled correctly on your board.

we need to see dmesg during boot up? I canā€™t do it while ssh right?

You can just run sudo dmesg in your ssh console and share the log as file here.