Xid 62 / Xid 154 GSP PMU halt crash on RTX PRO 6000 Blackwell during LLM inference

Xid 62 / Xid 120 / Xid 154 — GSP firmware crashes on RTX PRO 6000 Blackwell under sustained GPU load

Summary

NVIDIA RTX PRO 6000 Blackwell Workstation Edition (GB202) suffers recurring GSP firmware crashes under sustained compute load, always affecting GPU 0 (PCI:0000:01:00). The crashes manifest as different Xid errors depending on the workload but always result in Xid 154 (GPU Reset Required) and require a full PSU power cycle to recover — soft reboot is insufficient.

This occurs on driver 595.58.03 (open kernel modules), which was installed specifically to address earlier Xid 154 crashes on driver 580.126.09. The problem persists across multiple workload types including pure CUDA stress tests (gpu_burn), ruling out application-level causes.

Crash Events

Crash 1 — gpu_burn stress test (2026-03-29)

Running gpu_burn 120 (120-second CUDA stress test on both GPUs). GPU 0 crashed ~2 minutes in with a GSP page fault:

Mar 29 15:38:59 kernel: NVRM: Xid (PCI:0000:01:00): 120, GSP kernel exception: store access page fault (cause:0xf) @ pc:0xffffffff93003d42, partition:4#0
Mar 29 15:38:59 kernel: NVRM: Xid (PCI:0000:01:00): 154, GPU recovery action changed from 0x0 (None) to 0x1 (GPU Reset Required)

Crash 2 — vLLM inference (2026-03-27)

Running vLLM 0.17.1 serving a MoE model (FP8, TP=2). GPU 0 crashed during active inference with a PMU halt:

Mar 27 17:40:36 kernel: NVRM: Xid (PCI:0000:01:00): 62, 223b6a66 00002100 00000000 20662342 20662ba6 20664272 2065f7ca 2063f48c
Mar 27 17:40:36 kernel: NVRM: GPU0 _kgspRpcGspEventPmuHalted: Received signal from GSP that PMU has halted.
Mar 27 17:40:36 kernel: NVRM: Xid (PCI:0000:01:00): 154, GPU recovery action changed from 0x0 (None) to 0x1 (GPU Reset Required)
Mar 27 17:40:36 kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=12542, name=VLLM::Worker, channel 0x00000002

After both crashes, the driver enters a loop of failed teardown attempts:

NVRM: krcWatchdog_IMPL: RC watchdog: GPU is probably locked!  Notify Timeout Seconds: 7
NVRM: GPU0 kgmmuClientShadowFaultBufferUnregister_IMPL: Unregistering non-replayable fault buffer failed (status=0x00000062)

The RC watchdog fires every ~8 seconds indefinitely. nvidia-smi shows GPU 0 with ERR! status. GPU does not recover without a full PSU power cycle.

Key Observations

  • Always GPU 0 (PCI:0000:01:00) — GPU 1 (PCI:0000:03:00) has never crashed.
  • Different Xid root causes — Xid 120 (GSP page fault) from gpu_burn, Xid 62 (PMU halt) from vLLM. Both trigger Xid 154.
  • Reproducible with gpu_burn — This is a pure CUDA GEMM stress test, eliminating ML framework software as the cause.
  • Previously crashed on driver 580.126.09 with Xid 154 (no Xid 62/120 precursors logged on that driver).
  • Persists on driver 595.58.03 which was intended to fix Blackwell GSP issues.
  • GPU 0 has no display attached — display runs on the AMD iGPU. Xorg does not touch GPU 0.

System Information

Component Details
GPU 2x NVIDIA RTX PRO 6000 Blackwell Workstation Edition (GB202)
VBIOS 98.02.52.00.02
Board Part Number 900-5G144-2200-000
GPU 0 Serial 1791625016090
GPU 0 UUID GPU-df022e84-9140-5762-5007-49bd93f72c3e
GPU 0 PCIe 0000:01:00.0
GPU 1 Serial 1791625016387
GPU 1 UUID GPU-dd776a30-8270-99ed-ad2d-edd0d934d3c8
GPU 1 PCIe 0000:03:00.0
Driver 595.58.03 (open kernel modules, .run installer)
CUDA 13.2
OS Ubuntu 24.04.4 LTS
Kernel 6.17.0-14-generic (x86_64)
CPU AMD Ryzen 9 9950X3D 16-Core
RAM 128 GB

Steps to Reproduce

  1. Install driver 595.58.03 with open kernel modules on Blackwell RTX PRO 6000
  2. Run gpu_burn 120 (available at GitHub - wilicc/gpu-burn: Multi-GPU CUDA stress test · GitHub )
  3. GPU 0 crashes with Xid 120 → Xid 154 within ~2 minutes

Expected Behavior

GPU should sustain compute workloads without GSP firmware crashes.

Actual Behavior

GPU 0 GSP firmware crashes (page fault or PMU halt), triggering unrecoverable Xid 154. Full PSU power cycle required.

Questions for NVIDIA

  1. The crash is always on GPU 0 (01:00.0), never GPU 1 (03:00.0). Could this indicate a hardware defect on GPU 0, or is the PCI topology relevant?
  2. Are there known GSP firmware issues with VBIOS 98.02.52.00.02 on GB202?
  3. Is there a newer VBIOS or driver beta that addresses Xid 120/62 on Blackwell?

Attachments

  • nvidia-bug-report.log.gz (captured while GPU 0 was in crashed state after the gpu_burn run) - NOT able to attach, the page never lets it load