Jetson AGX Xavier freezing intermittently

Having some issues with an AGX freezing intermittently. Front LED turns off, but the unit can be restarted via power on button. It is a little hard to troubleshoot, which is why I am here. All logs I have checked does not report anything particularly weird, except that logs seem badly terminated producing a lot of “^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@” at the end. I am unable to find anything particularly out of order after rebooting. CPU/GPU temperatures never reach very far above 60C, which I think should be okay.

Now, I am unsure what to do to troubleshoot this issue. I’ll be happy to supply logs, configs or other to get to the bottom of this. Thank you.

Is this AGX devkit or some custom board?

Which jetpack release version is that?

AGX devkit. Jetpack version is a little weird:

$ apt-cache show nvidia-jetpack
Package: nvidia-jetpack
Version: 4.6-b199
Architecture: arm64
Maintainer: NVIDIA Corporation
Installed-Size: 194
Depends: nvidia-cuda (= 4.6-b199), nvidia-opencv (= 4.6-b199), nvidia-cudnn8 (= 4.6-b199), nvidia-tensorrt (= 4.6-b199), nvidia-visionworks (= 4.6-b199), nvidia-container (= 4.6-b199), nvidia-vpi (= 4.6-b199), nvidia-l4t-jetson-multimedia-api (>> 32.6-0), nvidia-l4t-jetson-multimedia-api (<< 32.7-0)
Homepage: http://developer.nvidia.com/jetson
Priority: standard
Section: metapackages
Filename: pool/main/n/nvidia-jetpack/nvidia-jetpack_4.6-b199_arm64.deb
Size: 29376
SHA256: d67b85293cade45d81dcafebd46c70a97a0b0d1379ca48aaa79d70d8ba99ddf8
SHA1: 74d9cbdfe9af52baa667e321749b9771101fc6de
MD5sum: cd1b3a0b651cd214b15fa76f6b5af2cd
Description: NVIDIA Jetpack Meta Package
Description-md5: ad1462289bdbc54909ae109d1d32c0a8

Package: nvidia-jetpack
Version: 4.6-b197
Architecture: arm64
Maintainer: NVIDIA Corporation
Installed-Size: 194
Depends: nvidia-cuda (= 4.6-b197), nvidia-opencv (= 4.6-b197), nvidia-cudnn8 (= 4.6-b197), nvidia-tensorrt (= 4.6-b197), nvidia-visionworks (= 4.6-b197), nvidia-container (= 4.6-b197), nvidia-vpi (= 4.6-b197), nvidia-l4t-jetson-multimedia-api (>> 32.6-0), nvidia-l4t-jetson-multimedia-api (<< 32.7-0)
Homepage: http://developer.nvidia.com/jetson
Priority: standard
Section: metapackages
Filename: pool/main/n/nvidia-jetpack/nvidia-jetpack_4.6-b197_arm64.deb
Size: 29372
SHA256: acec83ad0c1ef05caf9b8ccc6a975c4fb2a7f7830cbe63bbcf7b196a6c1f350e
SHA1: 3e11456cf0ec6b3a40d81b80ca1e14cebafa65ff
MD5sum: 72b2b7b280793bd4abdabe0d38b08535
Description: NVIDIA Jetpack Meta Package
Description-md5: ad1462289bdbc54909ae109d1d32c0a8

Are you using a clean setup? I mean pure software from jetpack/sdkm.

Are you using a uart console to check the log or just syslog and dmesg?

Is there any specific usecase/application to hit this issue?
Do you have other jetson xavier devices to validate same use case running?

Not clean setup. A few things plugged in, a full ROS2 suite running inside a docker image on the Jetson while freezing. I cannot find any logs from inside the software that indicates an error, which is why I come here. Log files are checked from /var/log/syslog and other log files in the same folder. Plugging micro usb to jetson, then triggering freeze produces no output.

Unable to test with another jetson at the moment, but I will be able to soon. Perhaps I can get back to you with more details then. I am happy to receive further tips on how to troubleshoot this otherwise.

edit: Serial console just printed this right before crashing:
[ 315.185774] nr_pdflush_threads exported in /proc is scheduled for removal
What gives?

Final mes

Hi,

full ROS2 suite running inside a docker image on the Jetson while freezing

  1. We have no idea about this. Maybe you can try to use clean setup with similar application running first and see if you can reproduce issue.

  2. syslog may not able to get the error when system crash or freeze. If this is devkit, use the uart console to monitor the log. Also, if CPU hangs over few minutes, the watchdog timer shall reboot the device.

  3. Please share the full log instead of partial log. Your error log seems indicate 5 min to reproduce issue. Is it always this fast to reproduce issue?

Yes, I will not ask you to debug my software stuff. I much appreciate getting advice on how to approach an issue like this.

Full log here. (zerobin.net). The final log bit I pasted is more or less all I get except for boot stuff. Issue can reliably be triggered, but the precise time to failure varies.

It did reboot one time, but most of the time, the front LED goes black and the jetson does not power on.

Can we get more description about the status of the board when error happened?

You mention the “front LED goes black”. Are you saying the power indicator of the AGX goes off when this error happened?
If that is your case, it indicates the power is down. For such case, it is likely to be a hardware issue like unstable power.

If this is software problem, then most common behavior would be the log spews some panic log and the reboot.

Unstable power seems like a good candidate! I continue troubleshooting other bits of the system. Thanks for helping me diagnose this issue.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.