Bug Report & Fix: RTX 5090 — Xid 79 GSP Firmware Crash Under Sustained CUDA Load

Seeing an identical issue with vLLM and long inference jobs. Haven’t tried downgrading AGESA/BIOS because ASRock boards are famous for blowing up 9800X3Ds.

Issue does not seem to be power/thermal related, I have plenty of headroom on both.

Component Detail
GPU NVIDIA GeForce RTX 5090 (PCI 0000:01:00.0, VBIOS 98.02.2E.00.E4)
Motherboard ASRock X870 Steel Legend WiFi
CPU AMD Ryzen 7 9800X3D (8-core, AM5)
Chipset AMD X870
OS Arch Linux
Kernel 7.0.9-arch1-1
NVIDIA driver 595.71.05 (NVIDIA Open Kernel Module, x86_64)
Display server KDE (Wayland)
Workload vLLM 0.21.0 serving sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP (Qwen3.6 27B NVFP4 + MTP draft speculative decoding), continuous single-stream inference at 195k input + 2048 output tokens, fp8 KV cache, CUDA graphs enabled, gpu-memory-utilization=0.92.

Kernel Logs at Crash

NVRM: GPU at PCI:0000:01:00: GPU-beafa3e5-0d05-0524-6c63-e3698b624fbd
NVRM: Xid (PCI:0000:01:00): 79, pid=12625, name=btop, GPU has fallen off the bus.
NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
NVRM: GPU0 krcRcAndNotifyAllChannels_IMPL: RC all channels for critical error 79.
NVRM: GPU0 _kgspIsHeartbeatTimedOut: Heartbeat timed out, currentTimeMs 0 heartbeat 4294967295
                                     heartbeatWithOffsetMs 890989791 diff 3403977505 timeout 5200
NVRM: GPU0 _kgspRpcRecvPoll: GSP RM heartbeat timed out
NVRM: GPU0 _kgspRpcRecvPoll: LibOS heartbeat timed out
NVRM: Xid (PCI:0000:01:00): 154, GPU recovery action changed from 0x0 (None) to 0x2 (Node Reboot Required)
NVRM: GPU0 _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78
NVRM: GPU0 nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST]
NVRM: GPU0 nvAssertFailedNoLog: Assertion failed: (status == NV_OK) || (status == NV_ERR_GPU_IN_FULLCHIP_RESET)
NVRM: GPU0 kgmmuFaultBufferReplayableDestroy_IMPL: Unregistering Replayable Fault buffer failed (status=0x0000000f), proceeding...
NVRM: GPU0 uvmTerminateAccessCntrBuffer_IMPL: Unloading UVM Access counters failed (status=0x0000000f), proceeding...
NVRM: _issueRpcLarge: rpcSendMessage failed with status 0x0000000f for fn 76!