Jetson AGX Orin (JetPack 6.2.1): silent GPU hang - host1x interrupt servicing stalls under sustained compute, reproduces on two distinct Orin systems

Summary

Our system is built around Jetson AGX Orin for a 24/7 continuous-inference application, and long-duration stability is a hard requirement for the workload. We have been hitting a recurring GPU hang during sustained-compute testing that we have not been able to work around in software, and it is a blocker for our production launch. The failure is silent from the GPU driver side — no XID, no NVRM error, no MMU/SMMU fault. nvidia-smi and tegrastats hang indefinitely, our GPU-using containers go unresponsive and become unkillable (SIGKILL is ignored, so docker restart does not recover them), and the only recovery we have found is a full reboot, which is not viable for 24/7 operation.

We have captured five separate incidents across two distinct Orin systems on two different L4T builds within JetPack 6.2.1. Time to failure ranges from about an hour to a few days of sustained inference. The most consistent diagnostic observable is that the host1x IRQ counters - which are all affinity-pinned to CPU0 - stop advancing, while every CPU’s IPI1 (function-call interrupt) counter shows storm-level activity (10⁸-class counts across all cores).

We would really appreciate NVIDIA’s help identifying the root cause and a path to a fix. Our integration is built around JetPack 6.2.1, and a JetPack version change would mean months of re-integration work on the third-party system, so a fix we can apply on 6.2.1 would be ideal.

Systems

  • Hardware:
    • System A (Dev Kit): NVIDIA Jetson AGX Orin Developer Kit, AGX Orin 64 GB module
    • System B (third-party carrier): embedded system built around the same AGX Orin 64 GB module (different carrier, different system integrator)
  • JetPack: 6.2.1 on both systems
    • System A (Dev Kit): L4T 36.4.7-20250918154033 (Sep 18 2025)
    • System B (third-party carrier): L4T 36.4.4-20250616085344 (Jun 16 2025)
  • We observe the same behavior on both L4T component builds (36.4.4 and 36.4.7) shipped within JetPack 6.2.1.

Software stack (inference containers)

  • Container base: nvcr.io/nvidia/l4t-tensorrt:r10.3.0-devel (NGC L4T container, Ubuntu 22.04)
  • TensorRT 10.7.0 (installed into the Docker image from the local Tegra apt repo nv-tensorrt-local-tegra-repo-ubuntu2204-10.7.0-cuda-12.6, replacing the image’s bundled TRT 10.3 after an earlier memory issue we had discussed in forum thread 318948)
  • CUDA 12.6
  • Python 3.x inference process driving TensorRT directly (no DeepStream/NvDCF in the pure-TRT repro)
  • DeepStream 7.1 with NvDCF tracker (DeepStream/NvDCF path only - Incidents 1, 2, 4)

Workload

Sustained GPU inference workload at a steady frame rate. Tested two paths:

  • TensorRT inference only (pure Python + TensorRT, no GStreamer), OR
  • TensorRT + DeepStream 7.1 with NvDCF tracker

Both paths reproduce on both systems, covering the full {dev kit, third-party embedded system} × {pure TensorRT, DeepStream/NvDCF} matrix. The common factor across all four corners is sustained GPU compute - this does not appear to be DeepStream/NvDCF-specific or carrier-specific.

Symptoms (consistent across 5 incidents)

  • nvidia-smi blocks indefinitely, immune to Ctrl-C

  • tegrastats blocks indefinitely, immune to Ctrl-C

  • Docker containers using GPU become unkillable (SIGKILL ignored)

  • Shell remains responsive; non-GPU operations continue

  • No GPU driver errors logged - no XID, no NVRM, no MMU fault, no SMMU fault. In three of the five incidents (1, 2, 4) the only dmesg evidence of the wedge is recurring:

    NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #10!!!
    

    Handler #10 corresponds to TASKLET_SOFTIRQ, indicating tasklet softirq work is pending on a core whose tick has been stopped. In Incident 5 these warnings did not appear in the available dmesg buffer (likely overwritten before capture), so we cannot say they precede every incident.

Diagnostic signatures (captured live on hung system)

1. host1x IRQs frozen

All host1x IRQs are affinity-pinned to CPU0. Counts do not advance (verified with 5-second snapshots):

191:        736  [only CPU0]  host1x_syncpt
192:     466312  [only CPU0]  host1x_syncpt
193:     714488  [only CPU0]  host1x_syncpt
199:    3361629  [only CPU0]  host1x_general

2. TASKLET softirq asymmetry on CPU0

Strongly concentrated on the host1x-pinned CPU across all incidents (CPU0 vs next-highest other core):

     TASKLET CPU0:    85,968,661    (others: ~20,000)         Incident 2
     TASKLET CPU0:   184,303,479    (others: low)             Incident 3
     TASKLET CPU0:        50,288    (others: 1–2,000)         Incident 4
     TASKLET CPU0:        90,052    (others: 0–1,295)         Incident 5

The CPU0 dominance ratio is consistent (10×–10000× the next-highest core), but the absolute magnitude varies by 4 orders across incidents → see point 3.

3. IPI1 (function-call interrupt) storm

The dominant observable, present in every incident, ranging from 10⁸ to ~4×10⁸ per CPU:

IPI1:    53M–131M per CPU    (Incident 2)
IPI1:   115M–214M per CPU    (Incident 3)
IPI1:    66M–85M  per CPU    (Incident 4)
IPI1:   278M–400M per CPU    (Incident 5)

4. Kworker caught mid-context-switch

A +pm kworker appears to get caught by the IPI storm and is stuck in state R with wchan=0, return address corrupted on-stack. Different CPU each incident, which is why we interpret this as a downstream symptom rather than the cause:

Incident Workload Kworker Corrupt PC
1 DeepStream/NvDCF kworker/9:0+pm 0xec00000000000000
2 DeepStream/NvDCF kworker/0:1+pm 0x1000
3 TensorRT-only kworker/8:1+pm 0xb9b
4 DeepStream/NvDCF kworker/1:1+pm 0x3d00000000000000
5 TensorRT-only kworker/4:0+pm 0xb9b

The corrupt PC 0xb9b appeared in both TensorRT-only incidents (one per system), while the three DeepStream/NvDCF incidents each produced distinct values.

Example stack (Incident 4):

[<0>] __switch_to+0x104/0x160
[<0>] do_interrupt_handler+0x70/0x80
[<0>] do_interrupt_handler+0x70/0x80
[<0>] exit_el1_irq_or_nmi.isra.0+0x10/0x20
[<0>] el1_interrupt+0x48/0x80
[<0>] el1h_64_irq_handler+0x18/0x30
[<0>] el1h_64_irq+0x7c/0x80
[<0>] 0x3d00000000000000

5. nvgpu kernel thread is idle

nvgpu_channel_p is in its normal nvgpu_worker_poll_work wait state (S) in Incidents 4 and 5, not wedged. ksoftirqds are all idle.

6. No SMMU fault

All arm-smmu fault IRQs (170/171/187/188/189) at 0.

What we have ruled out

  • GDM3 / DCE RPC failures (bug reproduces on the dev kit with GDM3 disabled, and on the third-party embedded system with GDM3 active — GDM3 state is not a factor either way)
  • DeepStream / NvDCF (reproduces with pure TensorRT inference on both systems — closes the workload × hardware matrix)
  • nvgpu_release_firmware / nvgpu_string_validate firmware-management path (Incidents 4 and 5 had nvgpu_channel_p in normal poll state with the same hang signature)
  • CPU0-specific kworker bug (kworker wedge is on a different CPU every incident — five distinct CPUs across five incidents)
  • Carrier-level hardware defect (reproduces on the dev kit and on a third-party carrier built around the same Jetson AGX Orin 64 GB module, both workload paths). Note: both systems use the AGX Orin 64 GB module specifically, so a module-level hardware/firmware factor — or one specific to the 64 GB SKU — is not ruled out by this alone.
  • IOMMU / SMMU page fault (fault counters at 0 across all incidents)
  • Userspace application error (no driver-side error logged; failure is silent)

If there is a related tracker or known workaround we should be following, please point us to it. We can run instrumented kernels, capture kdumps, or ftrace host1x / nvgpu if that would help with triage. Thanks in advance for any guidance.


*** Please note that this reply is generated by LLM automatically ***

The issue described is a recurring GPU hang during sustained compute testing on Jetson AGX Orin systems, which is a blocker for production launch. The failure is silent from the GPU driver side, with no XID, NVRM error, or MMU/SMMU fault, and the only recovery is a full reboot.

The diagnostic signatures captured live on the hung system show that all host1x IRQs are affinity-pinned to CPU0, and their counts do not advance. The CPU0 dominance ratio is consistent, but the absolute magnitude varies by 4 orders across incidents. A kworker appears to get caught by the IPI storm and is stuck in state with a return address corrupted on-stack.

The issue has been observed on two distinct Orin systems, with different L4T builds within JetPack 6.2.1, and with different workloads (TensorRT inference only and TensorRT with DeepStream and NvDCF tracker). The common factor across all incidents is sustained GPU compute.

The symptoms are consistent across 5 incidents:

  • Blocks indefinitely, immune to SIGKILL
  • Docker containers using GPU become unkillable
  • Shell remains responsive, non-GPU operations continue
  • No XID, NVRM, MMU, or SMMU fault

In three of the five incidents, the only dmesg evidence of the wedge is recurring “Handler 10” warnings, indicating tasklet softirq work is pending on a core whose tick has been stopped.

The issue has been ruled out as:

  • GDM3 or DCE RPC failures (reproduces with GDM3 disabled and on a third-party embedded system with GDM3 active)
  • DeepStream or NvDCF specific (reproduces with pure TensorRT inference on both systems)
  • Carrier-level hardware defect (reproduces on the dev kit and on a third-party carrier built around the same Jetson AGX Orin 64 GB module)
  • Userspace application error (no driver-side error logged, failure is silent)

The root cause of the issue is still unknown, and a fix would be ideal. The user is willing to run instrumented kernels, capture kdumps, or ftrace if that would help with triage.

Comparison with other issues:

  • The issue is similar to a known issue with Jetson AGX Orin, where the GPU can hang indefinitely due to a tasklet softirq work pending on a core whose tick has been stopped. However, the root cause of this issue is still unknown.
  • The issue is different from a known issue with DeepStream and NvDCF, where the tracker can get stuck in an infinite loop. In this case, the issue reproduces with pure TensorRT inference, and the tracker is not involved.

Recommendations:

  • NVIDIA should investigate the root cause of the issue and provide a fix.
  • The user should try to capture more diagnostic information, such as kdumps or ftrace, to help with triage.
  • The user should consider upgrading to a newer version of JetPack, if possible, to see if the issue is fixed.
  • The user should consider using a different GPU or system configuration to see if the issue is specific to the Jetson AGX Orin.

*** Please give the thumbs up if you get answers you like. Or provide feedback to help us improve the answer. ***

Hi

We recently fixed a host1x bug.
Could you try to apply the patch below to see if it can help?

Thanks.