[Bug] Xid 62 GSP Watchdog Timeout on RTX 5080 during high-context LLM inference

Hardware: NVIDIA GeForce RTX 5080 (16GB)
Driver Version: 595.71.05 (Open Kernel Module)
OS: Proxmox VE 9.1.11 (Kernel 7.0.2-3-pve / Debian 12 based)
Workload: LLM Inference (llama.cpp)

1. Problem Description

The GPU enters an unrecoverable state (Xid 62) during high-context LLM inference tasks. The failure is highly deterministic and specifically reproducible when using Multi-Token Prediction (MTP) model architectures (e.g., Qwen3.6-35B-MTP) at context windows $\ge$ 128k. Standard MoE and Dense models are perfectly stable, suggesting a logic bug in the GSP scheduler when handling the specific tensor scheduling required by MTP architectures on Blackwell.

2. Environment & Mitigations (Active during crash)

To rule out environmental or hardware defects, this system operates with strict constraints. The crash occurred despite these active mitigations:

  • Power Capping: Active power limit set to 288 W (80% of 360 W TDP) via nvidia-smi -pl 288 to prevent transient voltage spikes.
  • Thermal Management: GPU temperature at the exact time of the crash was a very cool 52°C. Active cooling was running at 92% fan speed.
  • Persistence Mode: nvidia-persistenced was actively running on the host.
  • Isolation: The workload was running with direct GPU passthrough; no concurrent GPU processes were active on the host or in other containers.

3. Error Logs (GSP-CrashCat)

The GSP firmware consistently fails in the exact same scheduler thread across multiple incidents.

  • GSP Build ID: 6332050fe14352839619385982f83700835ef570
  • GSP Task: partition:4#0, task:3 (RISC-V Scheduler)
  • Forensic Address: GSP task watchdog timeout @ pc:0x1b3e200

Kernel Trace:
NVRM: Xid (PCI:0000:01:00): 62, 323f0f30 0000c3d8 00000000 206dae68 206da022 206da190 206d85f6 206d8d86
NVRM: GPU0 _kgspRpcGspEventPmuHalted: Received signal from GSP that PMU has halted.
NVRM: Xid (PCI:0000:01:00): 154, GPU recovery action changed from 0x0 (None) to 0x1 (GPU Reset Required)
NVRM: GPU0 kgspHealthCheck_TU102: ********** GSP-CrashCat Report ***********
NVRM: RISC-V CSR State: sstatus:0x0000000200000020 sscratch:0xffffffffa3013970 sie:0x0000000000000220 sip:0x0000000000000020
GSP task watchdog timeout @ pc:0x1b3e200, partition:4#0, task:3

4. Steps to Reproduce

  1. Load Qwen3.6-35B-A3B-MTP-GGUF (IQ2_M or Q2_K_XL).
  2. Set context size to 131,072.
  3. Set flash-attn = on and fit = on in llama.cpp.
  4. The GSP will reliably crash during the “fitting params to device memory” phase, even when VRAM usage is ~13.3 GB (leaving nearly 3 GB of physical headroom).

5. Technical Conclusion

The presence of strict power capping and stable, low thermals (52°C) at the time of failure strongly indicates that this is not a hardware overheating or power-delivery issue.
The deterministic nature of the crash (specific to MTP architectures and high context fitting) points to a firmware scheduler regression/deadlock in the 595.xx Blackwell GSP driver branch.

I have attached the nvidia-bug-report-2026-05-24.log.gz (768.3 KB) generated immediately after the Xid 62 event.

Just wanted to throw in, I got some Xid 62s on my GTX1650. I was running 595.71.05 then (after I realized I had accidentally pinned the driver, I do prefer to run ‘latest and greatest’) later running 610.43.02. In my case, it was under light load, youtube video playing popped out with light (i.e. non-gaming) desktop use. Xid 62, “Received signal from GSP that PMU has halted” then just messages about stuff timing out + that chip status is “NV_ERR_GPU_IN_FULLCHIP_RESET”.)
I dropped from 7.0.x kernel (I was on 7.0.10 before I reverted – Zabbly kernel which is “stock kernel built with stock Ubuntu kernel config”, i.e. it’s not all patched up) to the 6.17.0-35 distro kernel (this is on Ubuntu 24.04), no freezes. I also saw firefox crashes I think under memory pressure, which I assumed was a firefox bug but that quit happening once I downgraded the kernel too.

Worth a shot!