I’m using TX2 4GB with L4T 32.4.4 and for some of the modules (around 5%) system occasionally hangs and watchdog is triggered. This is highly irregular and usually happens once every 1-5 days, but sometimes it takes even more.
So far I have not found any correlation between system load/temperature/connections used etc. and the stability. Simplest case that I have tried is to install the module into the nvidia carrier board and only plug in HDMI, USB keyboard, ethernet, serial port and DC input and then just wait.
This leads me to believe that it could be a HW issue, as most modules work fine and I now have two of them that have been running over 2 months continuously without issues in real use (custom carrier board, meaningful CPU/GPU/IO load).
Even if this does turn out to be a HW fault in some modules and not a kernel bug for example, it’s a major inconvenience to perform additional testing for each module for a week or more. I would also expect that if the HW is not faulty, using nvidia carrier board with vanilla L4T 32.4.4 would not cause watchdog to eventually be triggered (I did also try L4T 32.5.1).
Any hints on how to debug this? I haven’t been able to get anything useful out of the Tegra with the serial port - the system just suddenly starts again with PMC reset reason = watchdog. Next I was going to look into the possibility of debugging over JTAG with the nvidia carrier board.