Summary
Our system is built around Jetson AGX Orin for a 24/7 continuous-inference application, and long-duration stability is a hard requirement for the workload. We have been hitting a recurring GPU hang during sustained-compute testing that we have not been able to work around in software, and it is a blocker for our production launch. The failure is silent from the GPU driver side — no XID, no NVRM error, no MMU/SMMU fault. nvidia-smi and tegrastats hang indefinitely, our GPU-using containers go unresponsive and become unkillable (SIGKILL is ignored, so docker restart does not recover them), and the only recovery we have found is a full reboot, which is not viable for 24/7 operation.
We have captured five separate incidents across two distinct Orin systems on two different L4T builds within JetPack 6.2.1. Time to failure ranges from about an hour to a few days of sustained inference. The most consistent diagnostic observable is that the host1x IRQ counters - which are all affinity-pinned to CPU0 - stop advancing, while every CPU’s IPI1 (function-call interrupt) counter shows storm-level activity (10⁸-class counts across all cores).
We would really appreciate NVIDIA’s help identifying the root cause and a path to a fix. Our integration is built around JetPack 6.2.1, and a JetPack version change would mean months of re-integration work on the third-party system, so a fix we can apply on 6.2.1 would be ideal.
Systems
- Hardware:
- System A (Dev Kit): NVIDIA Jetson AGX Orin Developer Kit, AGX Orin 64 GB module
- System B (third-party carrier): embedded system built around the same AGX Orin 64 GB module (different carrier, different system integrator)
- JetPack: 6.2.1 on both systems
- System A (Dev Kit): L4T
36.4.7-20250918154033(Sep 18 2025) - System B (third-party carrier): L4T
36.4.4-20250616085344(Jun 16 2025)
- System A (Dev Kit): L4T
- We observe the same behavior on both L4T component builds (36.4.4 and 36.4.7) shipped within JetPack 6.2.1.
Software stack (inference containers)
- Container base:
nvcr.io/nvidia/l4t-tensorrt:r10.3.0-devel(NGC L4T container, Ubuntu 22.04) - TensorRT 10.7.0 (installed into the Docker image from the local Tegra apt repo
nv-tensorrt-local-tegra-repo-ubuntu2204-10.7.0-cuda-12.6, replacing the image’s bundled TRT 10.3 after an earlier memory issue we had discussed in forum thread318948) - CUDA 12.6
- Python 3.x inference process driving TensorRT directly (no DeepStream/NvDCF in the pure-TRT repro)
- DeepStream 7.1 with NvDCF tracker (DeepStream/NvDCF path only - Incidents 1, 2, 4)
Workload
Sustained GPU inference workload at a steady frame rate. Tested two paths:
- TensorRT inference only (pure Python + TensorRT, no GStreamer), OR
- TensorRT + DeepStream 7.1 with NvDCF tracker
Both paths reproduce on both systems, covering the full {dev kit, third-party embedded system} × {pure TensorRT, DeepStream/NvDCF} matrix. The common factor across all four corners is sustained GPU compute - this does not appear to be DeepStream/NvDCF-specific or carrier-specific.
Symptoms (consistent across 5 incidents)
-
nvidia-smiblocks indefinitely, immune toCtrl-C -
tegrastatsblocks indefinitely, immune toCtrl-C -
Docker containers using GPU become unkillable (SIGKILL ignored)
-
Shell remains responsive; non-GPU operations continue
-
No GPU driver errors logged - no XID, no NVRM, no MMU fault, no SMMU fault. In three of the five incidents (1, 2, 4) the only dmesg evidence of the wedge is recurring:
NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #10!!!Handler #10 corresponds to
TASKLET_SOFTIRQ, indicating tasklet softirq work is pending on a core whose tick has been stopped. In Incident 5 these warnings did not appear in the available dmesg buffer (likely overwritten before capture), so we cannot say they precede every incident.
Diagnostic signatures (captured live on hung system)
1. host1x IRQs frozen
All host1x IRQs are affinity-pinned to CPU0. Counts do not advance (verified with 5-second snapshots):
191: 736 [only CPU0] host1x_syncpt
192: 466312 [only CPU0] host1x_syncpt
193: 714488 [only CPU0] host1x_syncpt
199: 3361629 [only CPU0] host1x_general
2. TASKLET softirq asymmetry on CPU0
Strongly concentrated on the host1x-pinned CPU across all incidents (CPU0 vs next-highest other core):
TASKLET CPU0: 85,968,661 (others: ~20,000) Incident 2
TASKLET CPU0: 184,303,479 (others: low) Incident 3
TASKLET CPU0: 50,288 (others: 1–2,000) Incident 4
TASKLET CPU0: 90,052 (others: 0–1,295) Incident 5
The CPU0 dominance ratio is consistent (10×–10000× the next-highest core), but the absolute magnitude varies by 4 orders across incidents → see point 3.
3. IPI1 (function-call interrupt) storm
The dominant observable, present in every incident, ranging from 10⁸ to ~4×10⁸ per CPU:
IPI1: 53M–131M per CPU (Incident 2)
IPI1: 115M–214M per CPU (Incident 3)
IPI1: 66M–85M per CPU (Incident 4)
IPI1: 278M–400M per CPU (Incident 5)
4. Kworker caught mid-context-switch
A +pm kworker appears to get caught by the IPI storm and is stuck in state R with wchan=0, return address corrupted on-stack. Different CPU each incident, which is why we interpret this as a downstream symptom rather than the cause:
| Incident | Workload | Kworker | Corrupt PC |
|---|---|---|---|
| 1 | DeepStream/NvDCF | kworker/9:0+pm |
0xec00000000000000 |
| 2 | DeepStream/NvDCF | kworker/0:1+pm |
0x1000 |
| 3 | TensorRT-only | kworker/8:1+pm |
0xb9b |
| 4 | DeepStream/NvDCF | kworker/1:1+pm |
0x3d00000000000000 |
| 5 | TensorRT-only | kworker/4:0+pm |
0xb9b |
The corrupt PC 0xb9b appeared in both TensorRT-only incidents (one per system), while the three DeepStream/NvDCF incidents each produced distinct values.
Example stack (Incident 4):
[<0>] __switch_to+0x104/0x160
[<0>] do_interrupt_handler+0x70/0x80
[<0>] do_interrupt_handler+0x70/0x80
[<0>] exit_el1_irq_or_nmi.isra.0+0x10/0x20
[<0>] el1_interrupt+0x48/0x80
[<0>] el1h_64_irq_handler+0x18/0x30
[<0>] el1h_64_irq+0x7c/0x80
[<0>] 0x3d00000000000000
5. nvgpu kernel thread is idle
nvgpu_channel_p is in its normal nvgpu_worker_poll_work wait state (S) in Incidents 4 and 5, not wedged. ksoftirqds are all idle.
6. No SMMU fault
All arm-smmu fault IRQs (170/171/187/188/189) at 0.
What we have ruled out
- GDM3 / DCE RPC failures (bug reproduces on the dev kit with GDM3 disabled, and on the third-party embedded system with GDM3 active — GDM3 state is not a factor either way)
- DeepStream / NvDCF (reproduces with pure TensorRT inference on both systems — closes the workload × hardware matrix)
nvgpu_release_firmware/nvgpu_string_validatefirmware-management path (Incidents 4 and 5 hadnvgpu_channel_pin normal poll state with the same hang signature)- CPU0-specific kworker bug (kworker wedge is on a different CPU every incident — five distinct CPUs across five incidents)
- Carrier-level hardware defect (reproduces on the dev kit and on a third-party carrier built around the same Jetson AGX Orin 64 GB module, both workload paths). Note: both systems use the AGX Orin 64 GB module specifically, so a module-level hardware/firmware factor — or one specific to the 64 GB SKU — is not ruled out by this alone.
- IOMMU / SMMU page fault (fault counters at 0 across all incidents)
- Userspace application error (no driver-side error logged; failure is silent)
If there is a related tracker or known workaround we should be following, please point us to it. We can run instrumented kernels, capture kdumps, or ftrace host1x / nvgpu if that would help with triage. Thanks in advance for any guidance.