Kernel Panic - Memory Controller EMEM Decode Errors + Poison Bit

Daniel-NN-Labs · October 15, 2025, 6:11pm

[Jetson Thor] Kernel Panic - Memory Controller EMEM Decode Errors + Poison Bit

ISSUE SUMMARY

Experiencing reproducible kernel panics on Jetson Thor (R38.2.2) with memory controller errors.
System has crashed twice in 24 hours with identical symptoms.

CRITICAL ERROR PATTERN:
tegra-mc 8108020000.memory-controller: dispr: non-secure read @0x0000fffffffff4000x0000fffffffff4000x0000fffffffff4000x0000fffffffff400: EMEM address deco@0x0000000000000000ontrollere error
tegra-mc 8108020000.memory-@0x0000000000000000ontroller: ptcr: @0x0000000000000000: Read response with poison bit error status:0
arm-smmu-v3 8806000000.iommu: EVTQ overflow detected – events lost

Followed 7 minutes later by:
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000019

SYSTEM CONFIGURATION

Platform: Jetson Thor (Blackwell GPU, Part 2B00–A1)
Jetpack: R38.2.2 (GCID: 42205042)
Kernel: 6.8.12-tegra #1 SMP PREEMPT
Driver: 580.00
CUDA: 13.0
OS: Ubuntu 24.04
RAM: 122 GB
Workload: Docker containers with GPU acceleration (low utilization at crash)

SYMPTOMS

Reproducible: 2 crashes in 24 hours
Persistent: Memory errors continue after reboot
Pattern: EMEM errors at addresses near 0xfffffffff*
Poison Bit: Data corruption indicated
IOMMU Overflow: 32K+ events before crash

TIMELINE

Oct 14, 14:57:44 - First crash with identical symptoms
Oct 15, 13:16:13 - Memory controller errors begin
Oct 15, 13:23:06 - Kernel panic (NULL pointer dereference)
Oct 15, 13:29:23 - System reboots
Oct 15, 13:46:46 - Memory errors resume post-reboot

QUESTIONS FOR NVIDIA

Is this a known issue with R38.2.2 / Driver 580.00?
What causes poison bit errors on Thor SoC? Hardware or software?
Why EMEM decode failing at high addresses? (0xfffffffff*)
Diagnostic tools available for Thor memory controller?

CONTEXT

No memory pressure at crash - 70GB free
No thermal issues - 42-45C
No GPU overload - 0% utilization

Full technical report with complete logs available upon request.

DaneLLL · October 16, 2025, 3:18am

Hi,
Please share steps/commands to replicate it on developer kit. We will set up developer kit and check.

Daniel-NN-Labs · October 16, 2025, 12:57pm

Reproduction_Report.txt (23.6 KB)

Hello NVIDIA Jetson Team,

I’m experiencing a reproducible kernel panic on Jetson Thor (R38.2.2) when connecting via RDP while multiple GPU-enabled Docker containers are running. The system crashes within 2-7 minutes with memory controller EMEM decode errors, poison bit corruption, and IOMMU overflow.

CRITICAL FINDING:
The crash is triggered when gnome-remote-desktop initializes CUDA for hardware-accelerated H.264 RDP encoding while 5 other GPU/CUDA contexts are active (4 Docker containers + Xorg). The IOMMU event queue (32,768 entries) overflows with translation faults during the CUDA context creation phase, leading to memory controller failure and kernel panic.

REPRODUCTION RATE: 2/2 attempts (100%)

SYSTEM:

Hardware: Jetson Thor (Blackwell), Part# 2B00–A1
JetPack: R38.2.2 (GCID: 42205042)
Kernel: 6.8.12-tegra
Driver: 580.00 / CUDA 13.0

TRIGGER:

Start 4 GPU Docker containers (using public images: ollama, pytorch)
Connect via RDP
gnome-remote-desktop initializes CUDA
IOMMU overflow → Memory controller errors → Kernel panic

WORKAROUND (CONFIRMED):
Disabling CUDA in gnome-remote-desktop prevents the crash:
gsettings set org.gnome.desktop.remote-desktop.rdp enable-hw-h264 false

I’ve attached a comprehensive reproduction report with:

Complete step-by-step reproduction using public Docker images
Exact timestamps and logs from both crash occurrences
CUDA context analysis showing the 6-context saturation point
Detailed IOMMU overflow sequence
System configuration and diagnostics

QUESTIONS FOR NVIDIA:

Is 6 concurrent CUDA contexts within design limits for Thor SoC?
Is the 32,000 IOMMU event queue size adequate for multiple GPU contexts?
Should CUDA context creation be serialized to prevent IOMMU storms?
Why does the memory controller report poison bit errors at addresses near 0xfffffffff*?

Thank you for taking the time to investigate this with your dev kit, any guidance would be greatly appreciated. I’m happy to provide additional diagnostics or test patches.

The attached report contains all technical details, log excerpts, and reproduction steps needed to replicate the issue.

Best regards

Daniel Hill
-N|N-Labs

DaneLLL · November 19, 2025, 3:09am

There is no update from you for a period, assuming this is not an issue anymore.
Hence, we are closing this topic. If need further support, please open a new one.
Thanks ~1202

Hi,
Would you please share the one-by-one steps? So that we can simply follow the steps to quickly set up developer kit.

Topic		Replies	Views
Jetson Thor Linux Kernel Segfault after clean install Jetson Thor boot , kernel , ubuntu	2	35	December 30, 2025
Temp Thor fix for freezing and 100% cpu bursts Jetson Thor boot	2	250	November 18, 2025
Kernel error on boot, gpu info does not look complete Jetson Thor kernel	16	341	September 25, 2025
JP7 r38.2 System crash Jetson Thor kernel	9	331	January 22, 2026
Jetson AGX Thor: GUI freezes & system hangs (nvAssertFailed, rtl8852ce issues, reproducible after clean JetPack reinstall) Jetson Thor	16	965	January 20, 2026
How to fix mc-err error ? Jetson TK1	2	1871	May 17, 2018
NVIDAI Orin platforms boots up failed with extra kernel configs Jetson AGX Orin reboot	7	115	November 19, 2025
After connecting the display, frequent freezing occurs after restarting Jetson Thor board-design , software-and-drivers	2	15	February 3, 2026
Desktop crash issue Jetson Thor ubuntu	15	243	January 7, 2026
emem decode error? Jetson TX1	1	1410	June 16, 2016