Kernel Panic - Memory Controller EMEM Decode Errors + Poison Bit

[Jetson Thor] Kernel Panic - Memory Controller EMEM Decode Errors + Poison Bit

ISSUE SUMMARY

Experiencing reproducible kernel panics on Jetson Thor (R38.2.2) with memory controller errors.
System has crashed twice in 24 hours with identical symptoms.

CRITICAL ERROR PATTERN:
tegra-mc 8108020000.memory-controller: dispr: non-secure read @0x0000fffffffff4000x0000fffffffff4000x0000fffffffff4000x0000fffffffff400: EMEM address deco@0x0000000000000000ontrollere error
tegra-mc 8108020000.memory-@0x0000000000000000ontroller: ptcr: @0x0000000000000000: Read response with poison bit error status:0
arm-smmu-v3 8806000000.iommu: EVTQ overflow detected – events lost

Followed 7 minutes later by:
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000019

SYSTEM CONFIGURATION

Platform: Jetson Thor (Blackwell GPU, Part 2B00–A1)
Jetpack: R38.2.2 (GCID: 42205042)
Kernel: 6.8.12-tegra #1 SMP PREEMPT
Driver: 580.00
CUDA: 13.0
OS: Ubuntu 24.04
RAM: 122 GB
Workload: Docker containers with GPU acceleration (low utilization at crash)

SYMPTOMS

  1. Reproducible: 2 crashes in 24 hours
  2. Persistent: Memory errors continue after reboot
  3. Pattern: EMEM errors at addresses near 0xfffffffff*
  4. Poison Bit: Data corruption indicated
  5. IOMMU Overflow: 32K+ events before crash

TIMELINE

Oct 14, 14:57:44 - First crash with identical symptoms
Oct 15, 13:16:13 - Memory controller errors begin
Oct 15, 13:23:06 - Kernel panic (NULL pointer dereference)
Oct 15, 13:29:23 - System reboots
Oct 15, 13:46:46 - Memory errors resume post-reboot

QUESTIONS FOR NVIDIA

  1. Is this a known issue with R38.2.2 / Driver 580.00?
  2. What causes poison bit errors on Thor SoC? Hardware or software?
  3. Why EMEM decode failing at high addresses? (0xfffffffff*)
  4. Diagnostic tools available for Thor memory controller?

CONTEXT

No memory pressure at crash - 70GB free
No thermal issues - 42-45C
No GPU overload - 0% utilization

Full technical report with complete logs available upon request.

Hi,
Please share steps/commands to replicate it on developer kit. We will set up developer kit and check.

1 Like

Reproduction_Report.txt (23.6 KB)

Hello NVIDIA Jetson Team,

I’m experiencing a reproducible kernel panic on Jetson Thor (R38.2.2) when connecting via RDP while multiple GPU-enabled Docker containers are running. The system crashes within 2-7 minutes with memory controller EMEM decode errors, poison bit corruption, and IOMMU overflow.

CRITICAL FINDING:
The crash is triggered when gnome-remote-desktop initializes CUDA for hardware-accelerated H.264 RDP encoding while 5 other GPU/CUDA contexts are active (4 Docker containers + Xorg). The IOMMU event queue (32,768 entries) overflows with translation faults during the CUDA context creation phase, leading to memory controller failure and kernel panic.

REPRODUCTION RATE: 2/2 attempts (100%)

SYSTEM:

  • Hardware: Jetson Thor (Blackwell), Part# 2B00–A1
  • JetPack: R38.2.2 (GCID: 42205042)
  • Kernel: 6.8.12-tegra
  • Driver: 580.00 / CUDA 13.0

TRIGGER:

  1. Start 4 GPU Docker containers (using public images: ollama, pytorch)
  2. Connect via RDP
  3. gnome-remote-desktop initializes CUDA
  4. IOMMU overflow → Memory controller errors → Kernel panic

WORKAROUND (CONFIRMED):
Disabling CUDA in gnome-remote-desktop prevents the crash:
gsettings set org.gnome.desktop.remote-desktop.rdp enable-hw-h264 false

I’ve attached a comprehensive reproduction report with:

  • Complete step-by-step reproduction using public Docker images
  • Exact timestamps and logs from both crash occurrences
  • CUDA context analysis showing the 6-context saturation point
  • Detailed IOMMU overflow sequence
  • System configuration and diagnostics

QUESTIONS FOR NVIDIA:

  1. Is 6 concurrent CUDA contexts within design limits for Thor SoC?
  2. Is the 32,000 IOMMU event queue size adequate for multiple GPU contexts?
  3. Should CUDA context creation be serialized to prevent IOMMU storms?
  4. Why does the memory controller report poison bit errors at addresses near 0xfffffffff*?

Thank you for taking the time to investigate this with your dev kit, any guidance would be greatly appreciated. I’m happy to provide additional diagnostics or test patches.

The attached report contains all technical details, log excerpts, and reproduction steps needed to replicate the issue.

Best regards

Daniel Hill
-N|N-Labs

There is no update from you for a period, assuming this is not an issue anymore.
Hence, we are closing this topic. If need further support, please open a new one.
Thanks
~1202

Hi,
Would you please share the one-by-one steps? So that we can simply follow the steps to quickly set up developer kit.