RTX PRO 6000 Blackwell — Persistent Xid 31 (MMU Fault) and Xid 13 errors, fault follows card across PCIe slots

I’m experiencing persistent Xid 31 (MMU Fault) and Xid 13 (Illegal Instruction Encoding) errors on one of two identical RTX PRO 6000 Blackwell Workstation Edition

GPUs. The fault follows the card across PCIe slots. The other identical GPU in the same system has zero errors under the same workloads. NVIDIA Customer Care

directed me here.

System:

  • Motherboard: ASUS Pro WS WRX90E-SAGE SE

  • CPU: AMD Turin

  • BIOS: 1317

  • OS: Ubuntu 25.10 (kernel 6.17.0-19-generic)

  • Driver: 590.48.01 (also tested 580.126.09)

  • CUDA: 13.1

  • GPUs: 2x NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96GB each)

Troubleshooting performed:

1. Tested across two driver versions (580.126.09, 590.48.01) — same errors

2. Moved the card from one PCIe slot to another — fault follows the card

3. The other identical GPU runs the same workloads (vLLM inference, ~93GB VRAM, 98% utilization) with zero Xid errors

4. Ran NVIDIA’s own nvvs diagnostic — it also triggers Xid 31 on this card

Error patterns (47 total Xid events in journal):

  • Xid 31 MMU Fault across multiple engines (CE4, GRAPHICS) and multiple GPCs

  • Xid 13 Illegal Instruction Encoding + Multiple Warp Errors

  • Fault types include FAULT_PDE, FAULT_PTE, and FAULT_INFO_TYPE_UNSUPPORTED_KIND

  • Triggered by multiple processes: vLLM, python3, and nvvs

dmesg excerpts (card at original slot):

NVRM: Xid (PCI:0000:f1:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 0, SM 0): Illegal Instruction Encoding

NVRM: Xid (PCI:0000:f1:00): 13, Graphics SM Global Exception on (GPC 0, TPC 0, SM 0): Multiple Warp Errors

NVRM: Xid (PCI:0000:f1:00): 31, pid=430328, name=python3, MMU Fault: ENGINE CE4 HUBCLIENT_CE1 faulted @ 0x792c_68e00000. Fault is of type FAULT_PDE

ACCESS_TYPE_VIRT_WRITE

NVRM: Xid (PCI:0000:f1:00): 31, pid=15953, name=VLLM::EngineCor, MMU Fault: ENGINE CE4 HUBCLIENT_CE0 faulted @ 0x7c82_2c604000. Fault is of type FAULT_PDE

ACCESS_TYPE_VIRT_WRITE

dmesg excerpts (same card moved to second slot):

NVRM: Xid (PCI:0000:e1:00): 31, pid=7175, name=VLLM::Worker, MMU Fault: ENGINE GRAPHICS GPC9 GPCCLIENT_T1_10 faulted @ 0x3fba_ff81f000. Fault is of type FAULT_PDE

ACCESS_TYPE_VIRT_READ

NVRM: Xid (PCI:0000:e1:00): 31, pid=4319, name=VLLM::Worker, MMU Fault: ENGINE GRAPHICS GPC2 GPCCLIENT_T1_11 faulted @ 0x34cb_497c8000. Fault is of type FAULT_PDE

ACCESS_TYPE_VIRT_READ

nvvs also triggers faults on this card:

NVRM: Xid (PCI:0000:f1:00): 31, pid=10994, name=nvvs, MMU Fault: ENGINE GRAPHICS GPC11 GPCCLIENT_T1_3

NVRM: Xid (PCI:0000:f1:00): 31, pid=11490, name=nvvs, MMU Fault: ENGINE GRAPHICS GPC11 GPCCLIENT_T1_1

NVRM: Xid (PCI:0000:f1:00): 31, pid=12164, name=nvvs, MMU Fault: ENGINE GRAPHICS GPC8 GPCCLIENT_T1_1

nvidia-smi (healthy GPU running fine, faulty GPU idle):

±----------------------------------------------------------------------------------------+

| NVIDIA-SMI 590.48.01 Driver Version: 590.48.01 CUDA Version: 13.1 |

±----------------------------------------±-----------------------±---------------------+

| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |

|=========================================+========================+======================|

| 0 NVIDIA RTX PRO 6000 Blac… Off | 00000000:C1:00.0 Off | Off |

| 59% 83C P1 300W / 300W | 92974MiB / 97887MiB | 98% Default |

±----------------------------------------±-----------------------±---------------------+

| 1 NVIDIA RTX PRO 6000 Blac… Off | 00000000:E1:00.0 Off | Off |

| 30% 31C P8 9W / 300W | 2MiB / 97887MiB | 0% Default |

±----------------------------------------±-----------------------±---------------------+

I’m experiencing an identical issue, although I’m on ubuntu 24.04 and have just a single rtx pro 6000 card, I have the same mobo. Did you manage to resolve it? Was it a hardware fault?