Xid 62 / Xid 154 GSP PMU halt crash on RTX PRO 6000 Blackwell during LLM inference

tejeswar01 · March 27, 2026, 11:26pm

Xid 62 / Xid 120 / Xid 154 — GSP firmware crashes on RTX PRO 6000 Blackwell under sustained GPU load

Summary

NVIDIA RTX PRO 6000 Blackwell Workstation Edition (GB202) suffers recurring GSP firmware crashes under sustained compute load, always affecting GPU 0 (PCI:0000:01:00). The crashes manifest as different Xid errors depending on the workload but always result in Xid 154 (GPU Reset Required) and require a full PSU power cycle to recover — soft reboot is insufficient.

This occurs on driver 595.58.03 (open kernel modules), which was installed specifically to address earlier Xid 154 crashes on driver 580.126.09. The problem persists across multiple workload types including pure CUDA stress tests (gpu_burn), ruling out application-level causes.

Crash Events

Crash 1 — gpu_burn stress test (2026-03-29)

Running gpu_burn 120 (120-second CUDA stress test on both GPUs). GPU 0 crashed ~2 minutes in with a GSP page fault:

Mar 29 15:38:59 kernel: NVRM: Xid (PCI:0000:01:00): 120, GSP kernel exception: store access page fault (cause:0xf) @ pc:0xffffffff93003d42, partition:4#0
Mar 29 15:38:59 kernel: NVRM: Xid (PCI:0000:01:00): 154, GPU recovery action changed from 0x0 (None) to 0x1 (GPU Reset Required)

Crash 2 — vLLM inference (2026-03-27)

Running vLLM 0.17.1 serving a MoE model (FP8, TP=2). GPU 0 crashed during active inference with a PMU halt:

Mar 27 17:40:36 kernel: NVRM: Xid (PCI:0000:01:00): 62, 223b6a66 00002100 00000000 20662342 20662ba6 20664272 2065f7ca 2063f48c
Mar 27 17:40:36 kernel: NVRM: GPU0 _kgspRpcGspEventPmuHalted: Received signal from GSP that PMU has halted.
Mar 27 17:40:36 kernel: NVRM: Xid (PCI:0000:01:00): 154, GPU recovery action changed from 0x0 (None) to 0x1 (GPU Reset Required)
Mar 27 17:40:36 kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=12542, name=VLLM::Worker, channel 0x00000002

After both crashes, the driver enters a loop of failed teardown attempts:

NVRM: krcWatchdog_IMPL: RC watchdog: GPU is probably locked!  Notify Timeout Seconds: 7
NVRM: GPU0 kgmmuClientShadowFaultBufferUnregister_IMPL: Unregistering non-replayable fault buffer failed (status=0x00000062)

The RC watchdog fires every ~8 seconds indefinitely. nvidia-smi shows GPU 0 with ERR! status. GPU does not recover without a full PSU power cycle.

Key Observations

Always GPU 0 (PCI:0000:01:00) — GPU 1 (PCI:0000:03:00) has never crashed.
Different Xid root causes — Xid 120 (GSP page fault) from gpu_burn, Xid 62 (PMU halt) from vLLM. Both trigger Xid 154.
Reproducible with gpu_burn — This is a pure CUDA GEMM stress test, eliminating ML framework software as the cause.
Previously crashed on driver 580.126.09 with Xid 154 (no Xid 62/120 precursors logged on that driver).
Persists on driver 595.58.03 which was intended to fix Blackwell GSP issues.
GPU 0 has no display attached — display runs on the AMD iGPU. Xorg does not touch GPU 0.

System Information

Component	Details
GPU	2x NVIDIA RTX PRO 6000 Blackwell Workstation Edition (GB202)
VBIOS	98.02.52.00.02
Board Part Number	900-5G144-2200-000
GPU 0 Serial	1791625016090
GPU 0 UUID	GPU-df022e84-9140-5762-5007-49bd93f72c3e
GPU 0 PCIe	0000:01:00.0
GPU 1 Serial	1791625016387
GPU 1 UUID	GPU-dd776a30-8270-99ed-ad2d-edd0d934d3c8
GPU 1 PCIe	0000:03:00.0
Driver	595.58.03 (open kernel modules, .run installer)
CUDA	13.2
OS	Ubuntu 24.04.4 LTS
Kernel	6.17.0-14-generic (x86_64)
CPU	AMD Ryzen 9 9950X3D 16-Core
RAM	128 GB

Steps to Reproduce

Install driver 595.58.03 with open kernel modules on Blackwell RTX PRO 6000
Run gpu_burn 120 (available at GitHub - wilicc/gpu-burn: Multi-GPU CUDA stress test · GitHub )
GPU 0 crashes with Xid 120 → Xid 154 within ~2 minutes

Expected Behavior

GPU should sustain compute workloads without GSP firmware crashes.

Actual Behavior

GPU 0 GSP firmware crashes (page fault or PMU halt), triggering unrecoverable Xid 154. Full PSU power cycle required.

Questions for NVIDIA

The crash is always on GPU 0 (01:00.0), never GPU 1 (03:00.0). Could this indicate a hardware defect on GPU 0, or is the PCI topology relevant?
Are there known GSP firmware issues with VBIOS 98.02.52.00.02 on GB202?
Is there a newer VBIOS or driver beta that addresses Xid 120/62 on Blackwell?

Attachments

nvidia-bug-report.log.gz (captured while GPU 0 was in crashed state after the gpu_burn run) - NOT able to attach, the page never lets it load

Topic		Replies	Views
Xid 119 GSP Timeout on RTX 6000 Pro Blackwell (575.64.3) under Load – Reproducible Crash Linux	4	946	November 14, 2025
RTX PRO 6000 Blackwell — Persistent Xid 31 (MMU Fault) and Xid 13 errors, fault follows card across PCIe slots Linux	0	35	March 21, 2026
“RTX 5060 Ti (OCuLink) GSP Firmware Crash on 575.51.02 with Linux 6.8.12” Linux	1	616	May 22, 2025
GSP crashes on 590.48.01 Xid 62 Linux gaming	3	93	March 7, 2026
RTX 4060 Laptop GPU stuck unusable and powered on after a GSP error Linux kernel	13	838	October 8, 2025
RTX 5090 FE - Hard System Crash Under CUDA/LoRA Load - Xid 119 GSP Timeout (Linux Mint) Linux gaming	6	339	January 28, 2026
RTX PRO 5000 Blackwell Laptop GPU: Kernel-mode access violations (0x3B/0x1E) under sustained CUDA compute Drivers - Linux, Windows, MacOS cuda , rtx	1	51	April 1, 2026
Second RTX PRO 6000 GPU fails to initialize - GSP firmware error 0xb after showing ERR! status Linux boot , cuda , ubuntu	3	568	September 2, 2025
Driver 580 GSP firmware crash (Xid 120/154) on RTX 3070 Mobile with HDMI display — 535 works with GSP disabled Linux	1	103	March 27, 2026
Random Xid 62 error on ML workloads - Titan RTX Linux	0	753	July 8, 2020