Hello all,
I am reporting a critical system stability issue affecting dual NVIDIA GeForce RTX 5090 GPUs setup. This appears to be related to the known GSP firmware bug affecting Blackwell (RTX 50 series) GPUs. Please help to identify the issue.
SYSTEM INFORMATION
GPU: 2x NVIDIA GeForce RTX 5090
Driver: 580.126.09 (open kernel module)
OS: Ubuntu 24.04.3 LTS
Kernel: 6.17.0-22-generic
Motherboard: GIGABYTE TRX50 AI TOP(firmware F12g 03/31/2026)
CPU: AMD Ryzen Threadripper 7970X (32-core)
RAM: 256GB
PROBLEM DESCRIPTION
The system experiences HARD FREEZES requiring a physical power cycle to recover.
This occurs RANDOMLY, often during IDLE or low GPU load conditions.
Symptoms:
- Complete system lockup (no SSH, no keyboard, no display)
- Logs stop abruptly mid-write
- No kernel panic or crash dump generated
- Requires physical power button to recover
When logs are preserved (rare), the following errors appear:
[ 9.072467] kernel: NVRM: GPU at PCI:0000:41:00: GPU-8f4499d4-d4fb-126f-65b5-d6319979c633
[ 9.072471] kernel: NVRM: Xid (PCI:0000:41:00): 79, GPU has fallen off the bus.
[ 9.072479] kernel: NVRM: kgspRcAndNotifyAllChannels_IMPL: RC all channels for critical error 79.
[ 9.072679] kernel: NVRM: _kgspLogRpcSanityCheckFailure: GPU1 sanity check failed 0xf waiting for RPC response from GSP.
Expected function 4097 (GSP_INIT_DONE) sequence 0 (0x0 0x0).
[ 9.122798] kernel: NVRM: osInitNvMapping: *** Cannot attach gpu
- Is this a known GSP firmware issue affecting dual-RTX-5090 configurations?
- Is there a driver update (595.x or newer) that addresses this GSP timeout?
- Can you confirm if disabling GSP (nvidia.NVreg_EnableGpuFirmware=0) is safe for RTX 5090?
- Are there any kernel parameters or workarounds to prevent GSP timeouts?
- Is a driver fix planned for the 580/595 series?
This issue matches multiple confirmed bug reports:
-
“RTX 5070 (10de:2f04) — Spontaneous GSP RPC timeout, GPU lost from bus on 580.126.18”
RTX 5070 (10de:2f04) — Spontaneous GSP RPC timeout, GPU lost from bus on 580.126.18 -
“[Bug] Hard System Lockup / GPU Lost From Bus under Load - RTX 5070”
[Bug] Hard System Lockup / GPU Lost From Bus under Load - RTX 5070 (595.58.03) on Kernel 7.0.0 [Bug Report Attached] -
GitHub Issue #1080 - GSP heartbeat timeout
RTX 5090 (GB202): GSP heartbeat timeout -> Xid 109/8 under Vulkan load via Proton (595.58.03, 590.48.01) · Issue #1080 · NVIDIA/open-gpu-kernel-modules · GitHub
Best regards,
Shakhizat