Experiencing reproducible kernel panics on Jetson Thor (R38.2.2) with memory controller errors.
System has crashed twice in 24 hours with identical symptoms.
Pattern: EMEM errors at addresses near 0xfffffffff*
Poison Bit: Data corruption indicated
IOMMU Overflow: 32K+ events before crash
TIMELINE
Oct 14, 14:57:44 - First crash with identical symptoms
Oct 15, 13:16:13 - Memory controller errors begin
Oct 15, 13:23:06 - Kernel panic (NULL pointer dereference)
Oct 15, 13:29:23 - System reboots
Oct 15, 13:46:46 - Memory errors resume post-reboot
QUESTIONS FOR NVIDIA
Is this a known issue with R38.2.2 / Driver 580.00?
What causes poison bit errors on Thor SoC? Hardware or software?
Why EMEM decode failing at high addresses? (0xfffffffff*)
Diagnostic tools available for Thor memory controller?
CONTEXT
No memory pressure at crash - 70GB free
No thermal issues - 42-45C
No GPU overload - 0% utilization
Full technical report with complete logs available upon request.
I’m experiencing a reproducible kernel panic on Jetson Thor (R38.2.2) when connecting via RDP while multiple GPU-enabled Docker containers are running. The system crashes within 2-7 minutes with memory controller EMEM decode errors, poison bit corruption, and IOMMU overflow.
CRITICAL FINDING:
The crash is triggered when gnome-remote-desktop initializes CUDA for hardware-accelerated H.264 RDP encoding while 5 other GPU/CUDA contexts are active (4 Docker containers + Xorg). The IOMMU event queue (32,768 entries) overflows with translation faults during the CUDA context creation phase, leading to memory controller failure and kernel panic.
REPRODUCTION RATE: 2/2 attempts (100%)
SYSTEM:
Hardware: Jetson Thor (Blackwell), Part# 2B00–A1
JetPack: R38.2.2 (GCID: 42205042)
Kernel: 6.8.12-tegra
Driver: 580.00 / CUDA 13.0
TRIGGER:
Start 4 GPU Docker containers (using public images: ollama, pytorch)
WORKAROUND (CONFIRMED):
Disabling CUDA in gnome-remote-desktop prevents the crash:
gsettings set org.gnome.desktop.remote-desktop.rdp enable-hw-h264 false
I’ve attached a comprehensive reproduction report with:
Complete step-by-step reproduction using public Docker images
Exact timestamps and logs from both crash occurrences
CUDA context analysis showing the 6-context saturation point
Detailed IOMMU overflow sequence
System configuration and diagnostics
QUESTIONS FOR NVIDIA:
Is 6 concurrent CUDA contexts within design limits for Thor SoC?
Is the 32,000 IOMMU event queue size adequate for multiple GPU contexts?
Should CUDA context creation be serialized to prevent IOMMU storms?
Why does the memory controller report poison bit errors at addresses near 0xfffffffff*?
Thank you for taking the time to investigate this with your dev kit, any guidance would be greatly appreciated. I’m happy to provide additional diagnostics or test patches.
The attached report contains all technical details, log excerpts, and reproduction steps needed to replicate the issue.
There is no update from you for a period, assuming this is not an issue anymore.
Hence, we are closing this topic. If need further support, please open a new one.
Thanks ~1202
Hi,
Would you please share the one-by-one steps? So that we can simply follow the steps to quickly set up developer kit.