Device stuck after several weeks, watchdog

Hello everyone,

I have a device ( Orin Nano 8GB carrier board from seed ) which was fine for several weeks but recently completely stuck after such errors

...
Jan 25 06:00:10 qt163 ModemManager[782]: <info>  [base-manager] couldn't check support for device '/sys/devices/platform/140a0000.pcie/pci0008:00/0008:00:00.0/0008:01:00.0': not supported by any plugin
Jan 25 06:00:10 qt163 ModemManager[782]: <info>  [base-manager] couldn't check support for device '/sys/devices/platform/14100000.pcie/pci0001:00/0001:00:00.0/0001:01:00.0': not supported by any plugin
Jan 25 06:00:10 qt163 ModemManager[782]: <info>  [base-manager] couldn't check support for device '/sys/devices/platform/3550000.xudc': not supported by any plugin
Jan 25 06:00:10 qt163 kernel: [   16.651828] nvvrs_pseq 4-003c: CAUTION: interrupt status reg:0x10 set to 0x8
Jan 25 06:00:10 qt163 kernel: [   16.651837] nvvrs_pseq 4-003c: Clearing interrupts
Feb  2 11:31:37 qt163 systemd[1]: Started /etc/rc.local Compatibility.
...

I have a morning reboot at 6:00am everyday for avoid such cases, but device remain stuck until someone unplug the power and power again

I found out I have watchdog and documentation says

Watchdog framework support	Registration with WDT framework
System reset on CPU hang	System reset on WDT expiry
Suspend/resume support	Suspend/resume handling
Watchdog interrupt support	WDT reset on ISR
Watchdog polling/ping support	WDT start/stop/pin from user space
$ file /dev/watchdog
/dev/watchdog: character special (10/130)

Is it all just software watchdog and in case something happen on hardware level there is nothing todo for self restore?

Full log attached
syslog.txt (1.2 MB)

Hi nox.gias,

What’s your Jetpack version in use?

It seems you are using a custom board for Orin Nano.
Do you run any application on your board before it gets errors?

Could you boot your board now?
Have you tried to re-flash the board and check if it could get recovered?

HI Kevin,

Jetpack 5.1.1

It seems you are using a custom board for Orin Nano.

Yes its carrier board from seed ( p3509-a02+p3767-0000 )

Do you run any application on your board before it gets errors?

Yes, there is python application running, dont think problem in app, as all others devices are fine

Could you boot your board now?

Yes, its fine after manual restart, problem in such restart, when it delivered to client, its no possible to restart manually anymore.

Have you tried to re-flash the board and check if it could get recovered?

No, device are not bricked and start working again after manual restart. I am looking for solution when / if it will happen next time, device can repair it self ( restart ) and go on. Is this hardware problem?

Do you mean that only one board hit this issue?
If so, how about the fail rate?

Couldn’t your client perform the manual restart for your product?

How did you perform the reboot everyday?

Where did you get these messages?

Do you mean that only one board hit this issue?
If so, how about the fail rate?

Yeah, its happen second time over the year, total ~100 devices ( not so much, but before expansion to ~500 would be great to resolve it somehow )

Couldn’t your client perform the manual restart for your product?

Unfortunately no ( we tried ), best solution we came up are using smart sockets for remote restart, but that big problem to get them online in client protected network ( + security reasons ) So looking solution with board itself

How did you perform the reboot everyday?

Just a cron 0 6 * * * shutdown -r now

Where did you get these messages?

under LSIO section ( not sure what does it mean, google didnt helped )
https://docs.nvidia.com/jetson/archives/r35.4.1/DeveloperGuide/text/SO/JetsonOrinSeries.html#lsio

Have you tried to perform reboot stress test to reproduce and debug the issue?

Jan 25 06:00:07 qt163 kernel: [    1.541074] tegra_wdt_t18x 2190000.watchdog: Tegra WDT init timeout = 120 sec
Jan 25 06:00:07 qt163 kernel: [    1.548230] tegra_wdt_t18x 2190000.watchdog: Registered successfully

Watchdog has been enabled on your board by default and it would be triggered after 120s timeout.

Is your “stuck” issue happening when you run shutdown -r now at 6am? Or it happens occassionally?

Have you tried to perform reboot stress test to reproduce and debug the issue?

No, how to do that? I remember there is a tool which load CPU on 100%, something like that? Should we do it for each device for catch if there any hardware problem?

Watchdog has been enabled on your board by default and it would be triggered after 120s timeout.

Look like its coming from CPU itself, means if CPU experience internal fatal error for any reason, it will not help?

Is your “stuck” issue happening when you run shutdown -r now at 6am? Or it happens occassionally?

This time right after morning restart, last time its hard to recall ( was 6moths ago ), I think in a middle of day was kernel panic with a lot @^@^@^@ in logs, restart resolved the issue

Stress test for reboot is to help you finding if there’s potential issue hard to be reproduced and improve the reliability of your product.
You could just write a script to execute shutdown -r now command and let this script run automatically after boot up.