Seeing an identical issue with vLLM and long inference jobs. Haven’t tried downgrading AGESA/BIOS because ASRock boards are famous for blowing up 9800X3Ds.
Issue does not seem to be power/thermal related, I have plenty of headroom on both.
| Component | Detail |
|---|---|
| GPU | NVIDIA GeForce RTX 5090 (PCI 0000:01:00.0, VBIOS 98.02.2E.00.E4) |
| Motherboard | ASRock X870 Steel Legend WiFi |
| CPU | AMD Ryzen 7 9800X3D (8-core, AM5) |
| Chipset | AMD X870 |
| OS | Arch Linux |
| Kernel | 7.0.9-arch1-1 |
| NVIDIA driver | 595.71.05 (NVIDIA Open Kernel Module, x86_64) |
| Display server | KDE (Wayland) |
| Workload | vLLM 0.21.0 serving sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP (Qwen3.6 27B NVFP4 + MTP draft speculative decoding), continuous single-stream inference at 195k input + 2048 output tokens, fp8 KV cache, CUDA graphs enabled, gpu-memory-utilization=0.92. |
Kernel Logs at Crash
NVRM: GPU at PCI:0000:01:00: GPU-beafa3e5-0d05-0524-6c63-e3698b624fbd
NVRM: Xid (PCI:0000:01:00): 79, pid=12625, name=btop, GPU has fallen off the bus.
NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
NVRM: GPU0 krcRcAndNotifyAllChannels_IMPL: RC all channels for critical error 79.
NVRM: GPU0 _kgspIsHeartbeatTimedOut: Heartbeat timed out, currentTimeMs 0 heartbeat 4294967295
heartbeatWithOffsetMs 890989791 diff 3403977505 timeout 5200
NVRM: GPU0 _kgspRpcRecvPoll: GSP RM heartbeat timed out
NVRM: GPU0 _kgspRpcRecvPoll: LibOS heartbeat timed out
NVRM: Xid (PCI:0000:01:00): 154, GPU recovery action changed from 0x0 (None) to 0x2 (Node Reboot Required)
NVRM: GPU0 _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78
NVRM: GPU0 nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST]
NVRM: GPU0 nvAssertFailedNoLog: Assertion failed: (status == NV_OK) || (status == NV_ERR_GPU_IN_FULLCHIP_RESET)
NVRM: GPU0 kgmmuFaultBufferReplayableDestroy_IMPL: Unregistering Replayable Fault buffer failed (status=0x0000000f), proceeding...
NVRM: GPU0 uvmTerminateAccessCntrBuffer_IMPL: Unloading UVM Access counters failed (status=0x0000000f), proceeding...
NVRM: _issueRpcLarge: rpcSendMessage failed with status 0x0000000f for fn 76!