Debugging TX2 4GB reliability/watchdog issues

I’m using TX2 4GB with L4T 32.4.4 and for some of the modules (around 5%) system occasionally hangs and watchdog is triggered. This is highly irregular and usually happens once every 1-5 days, but sometimes it takes even more.

So far I have not found any correlation between system load/temperature/connections used etc. and the stability. Simplest case that I have tried is to install the module into the nvidia carrier board and only plug in HDMI, USB keyboard, ethernet, serial port and DC input and then just wait.

This leads me to believe that it could be a HW issue, as most modules work fine and I now have two of them that have been running over 2 months continuously without issues in real use (custom carrier board, meaningful CPU/GPU/IO load).

Even if this does turn out to be a HW fault in some modules and not a kernel bug for example, it’s a major inconvenience to perform additional testing for each module for a week or more. I would also expect that if the HW is not faulty, using nvidia carrier board with vanilla L4T 32.4.4 would not cause watchdog to eventually be triggered (I did also try L4T 32.5.1).

Any hints on how to debug this? I haven’t been able to get anything useful out of the Tegra with the serial port - the system just suddenly starts again with PMC reset reason = watchdog. Next I was going to look into the possibility of debugging over JTAG with the nvidia carrier board.

There is no update from you for a period, assuming this is not an issue any more.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

hello aki.reijonen,

I had also setup a TX2 remotely and never experience this.
do you have any SW modifications, are you working with default JetPack release image?
thanks

Hi,

I do have all kinds of modifications, but the devices that hang do so in all combinations that I have tested. Including using the nvidia carrier board with default image (tried L4T 32.4.4 and OTA update from that to 32.5.1).

hello aki.reijonen,

please based-on l4t-r32.5.1 with the Jetson TX2 DevKit to reproduce the issue,
please gather the kernel logs for reference, for example, $ dmesg --follow
thanks