My GPU is stuck with the IRQ/79-gk20a_st hang signal

Description

“I have about 2000 Jetson Nano devices using NVIDIA Tegra X1 to run AI models for camera devices. However, after about two years of operation, I am currently experiencing issues with around 20 devices that are encountering GPU hangs. After running for a period, I find that I cannot load the AI model from my firmware to the GPU, and I also see that some interrupts like IRQ/79-gk20a_st are hanging as well. I have tried everything to kill or restart them, but to no avail. The only solution I have is to reboot them, but the issue reoccurs after just 1-2 days. Could you provide me with a solution?”

Environment

root@ubuntu:/home/nano# jetson_release
Software part of jetson-stats 4.2.12 - (c) 2024, Raffaello Bonghi
Model: NVIDIA Jetson Nano Developer Kit - Jetpack 4.6.5 [L4T 32.7.5]
NV Power Mode[0]: MAXN
Serial Number: [XXX Show with: jetson_release -s XXX]
Hardware:

  • P-Number: p3448-0002
  • Module: NVIDIA Jetson Nano module (16Gb eMMC)
    Platform:
  • Distribution: Ubuntu 18.04 Bionic Beaver
  • Release: 4.9.337-tegra
    jtop:
  • Version: 4.2.12
  • Service: Active
    Libraries:
  • CUDA: 10.2.300
  • cuDNN: 8.2.1.32
  • TensorRT: 8.2.1.9
  • VPI: 1.2.3
  • Vulkan: 1.2.70
  • OpenCV: 4.1.1 - with CUDA: NO

Hello,

Thanks for visiting the NVIDIA Developer forums! Your topic will be best served in the Jetson category.

I will move this post over for visibility.

Cheers,
Tom

I don’t see my post in Jetson category.

Could you provude the full log for investigation?

Which log do you want to see? I don’t know what to do with it yet.

I further discovered that the crash problem only occurred when my source code encountered cuda error 702 when used for AI processing.

And I guess that after about 2 years, the GPU’s performance decreased, leading to error 702 and it led to the current crash error.