I’m experiencing persistent Xid 31 (MMU Fault) and Xid 13 (Illegal Instruction Encoding) errors on one of two identical RTX PRO 6000 Blackwell Workstation Edition
GPUs. The fault follows the card across PCIe slots. The other identical GPU in the same system has zero errors under the same workloads. NVIDIA Customer Care
directed me here.
System:
-
Motherboard: ASUS Pro WS WRX90E-SAGE SE
-
CPU: AMD Turin
-
BIOS: 1317
-
OS: Ubuntu 25.10 (kernel 6.17.0-19-generic)
-
Driver: 590.48.01 (also tested 580.126.09)
-
CUDA: 13.1
-
GPUs: 2x NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96GB each)
Troubleshooting performed:
1. Tested across two driver versions (580.126.09, 590.48.01) — same errors
2. Moved the card from one PCIe slot to another — fault follows the card
3. The other identical GPU runs the same workloads (vLLM inference, ~93GB VRAM, 98% utilization) with zero Xid errors
4. Ran NVIDIA’s own nvvs diagnostic — it also triggers Xid 31 on this card
Error patterns (47 total Xid events in journal):
-
Xid 31 MMU Fault across multiple engines (CE4, GRAPHICS) and multiple GPCs
-
Xid 13 Illegal Instruction Encoding + Multiple Warp Errors
-
Fault types include FAULT_PDE, FAULT_PTE, and FAULT_INFO_TYPE_UNSUPPORTED_KIND
-
Triggered by multiple processes: vLLM, python3, and nvvs
dmesg excerpts (card at original slot):
NVRM: Xid (PCI:0000:f1:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 0, SM 0): Illegal Instruction Encoding
NVRM: Xid (PCI:0000:f1:00): 13, Graphics SM Global Exception on (GPC 0, TPC 0, SM 0): Multiple Warp Errors
NVRM: Xid (PCI:0000:f1:00): 31, pid=430328, name=python3, MMU Fault: ENGINE CE4 HUBCLIENT_CE1 faulted @ 0x792c_68e00000. Fault is of type FAULT_PDE
ACCESS_TYPE_VIRT_WRITE
NVRM: Xid (PCI:0000:f1:00): 31, pid=15953, name=VLLM::EngineCor, MMU Fault: ENGINE CE4 HUBCLIENT_CE0 faulted @ 0x7c82_2c604000. Fault is of type FAULT_PDE
ACCESS_TYPE_VIRT_WRITE
dmesg excerpts (same card moved to second slot):
NVRM: Xid (PCI:0000:e1:00): 31, pid=7175, name=VLLM::Worker, MMU Fault: ENGINE GRAPHICS GPC9 GPCCLIENT_T1_10 faulted @ 0x3fba_ff81f000. Fault is of type FAULT_PDE
ACCESS_TYPE_VIRT_READ
NVRM: Xid (PCI:0000:e1:00): 31, pid=4319, name=VLLM::Worker, MMU Fault: ENGINE GRAPHICS GPC2 GPCCLIENT_T1_11 faulted @ 0x34cb_497c8000. Fault is of type FAULT_PDE
ACCESS_TYPE_VIRT_READ
nvvs also triggers faults on this card:
NVRM: Xid (PCI:0000:f1:00): 31, pid=10994, name=nvvs, MMU Fault: ENGINE GRAPHICS GPC11 GPCCLIENT_T1_3
NVRM: Xid (PCI:0000:f1:00): 31, pid=11490, name=nvvs, MMU Fault: ENGINE GRAPHICS GPC11 GPCCLIENT_T1_1
NVRM: Xid (PCI:0000:f1:00): 31, pid=12164, name=nvvs, MMU Fault: ENGINE GRAPHICS GPC8 GPCCLIENT_T1_1
nvidia-smi (healthy GPU running fine, faulty GPU idle):
±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.48.01 Driver Version: 590.48.01 CUDA Version: 13.1 |
±----------------------------------------±-----------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
|=========================================+========================+======================|
| 0 NVIDIA RTX PRO 6000 Blac… Off | 00000000:C1:00.0 Off | Off |
| 59% 83C P1 300W / 300W | 92974MiB / 97887MiB | 98% Default |
±----------------------------------------±-----------------------±---------------------+
| 1 NVIDIA RTX PRO 6000 Blac… Off | 00000000:E1:00.0 Off | Off |
| 30% 31C P8 9W / 300W | 2MiB / 97887MiB | 0% Default |
±----------------------------------------±-----------------------±---------------------+