I have a device ( Orin Nano 8GB carrier board from seed ) which was fine for several weeks but recently completely stuck after such errors
...
Jan 25 06:00:10 qt163 ModemManager[782]: <info> [base-manager] couldn't check support for device '/sys/devices/platform/140a0000.pcie/pci0008:00/0008:00:00.0/0008:01:00.0': not supported by any plugin
Jan 25 06:00:10 qt163 ModemManager[782]: <info> [base-manager] couldn't check support for device '/sys/devices/platform/14100000.pcie/pci0001:00/0001:00:00.0/0001:01:00.0': not supported by any plugin
Jan 25 06:00:10 qt163 ModemManager[782]: <info> [base-manager] couldn't check support for device '/sys/devices/platform/3550000.xudc': not supported by any plugin
Jan 25 06:00:10 qt163 kernel: [ 16.651828] nvvrs_pseq 4-003c: CAUTION: interrupt status reg:0x10 set to 0x8
Jan 25 06:00:10 qt163 kernel: [ 16.651837] nvvrs_pseq 4-003c: Clearing interrupts
Feb 2 11:31:37 qt163 systemd[1]: Started /etc/rc.local Compatibility.
...
I have a morning reboot at 6:00am everyday for avoid such cases, but device remain stuck until someone unplug the power and power again
I found out I have watchdog and documentation says
Watchdog framework support Registration with WDT framework
System reset on CPU hang System reset on WDT expiry
Suspend/resume support Suspend/resume handling
Watchdog interrupt support WDT reset on ISR
Watchdog polling/ping support WDT start/stop/pin from user space
$ file /dev/watchdog
/dev/watchdog: character special (10/130)
Is it all just software watchdog and in case something happen on hardware level there is nothing todo for self restore?
It seems you are using a custom board for Orin Nano.
Yes its carrier board from seed ( p3509-a02+p3767-0000 )
Do you run any application on your board before it gets errors?
Yes, there is python application running, dont think problem in app, as all others devices are fine
Could you boot your board now?
Yes, its fine after manual restart, problem in such restart, when it delivered to client, its no possible to restart manually anymore.
Have you tried to re-flash the board and check if it could get recovered?
No, device are not bricked and start working again after manual restart. I am looking for solution when / if it will happen next time, device can repair it self ( restart ) and go on. Is this hardware problem?
Do you mean that only one board hit this issue?
If so, how about the fail rate?
Yeah, its happen second time over the year, total ~100 devices ( not so much, but before expansion to ~500 would be great to resolve it somehow )
Couldn’t your client perform the manual restart for your product?
Unfortunately no ( we tried ), best solution we came up are using smart sockets for remote restart, but that big problem to get them online in client protected network ( + security reasons ) So looking solution with board itself
Have you tried to perform reboot stress test to reproduce and debug the issue?
No, how to do that? I remember there is a tool which load CPU on 100%, something like that? Should we do it for each device for catch if there any hardware problem?
Watchdog has been enabled on your board by default and it would be triggered after 120s timeout.
Look like its coming from CPU itself, means if CPU experience internal fatal error for any reason, it will not help?
Is your “stuck” issue happening when you run shutdown -r now at 6am? Or it happens occassionally?
This time right after morning restart, last time its hard to recall ( was 6moths ago ), I think in a middle of day was kernel panic with a lot @^@^@^@ in logs, restart resolved the issue
Stress test for reboot is to help you finding if there’s potential issue hard to be reproduced and improve the reliability of your product.
You could just write a script to execute shutdown -r now command and let this script run automatically after boot up.