Repeated system crash Ubuntu 24.04, Dual 5090 setup

Hello all,

I am reporting a critical system stability issue affecting dual NVIDIA GeForce RTX 5090 GPUs setup. This appears to be related to the known GSP firmware bug affecting Blackwell (RTX 50 series) GPUs. Please help to identify the issue.

SYSTEM INFORMATION

GPU: 2x NVIDIA GeForce RTX 5090
Driver: 580.126.09 (open kernel module)
OS: Ubuntu 24.04.3 LTS
Kernel: 6.17.0-22-generic
Motherboard: GIGABYTE TRX50 AI TOP(firmware F12g 03/31/2026)
CPU: AMD Ryzen Threadripper 7970X (32-core)
RAM: 256GB

PROBLEM DESCRIPTION

The system experiences HARD FREEZES requiring a physical power cycle to recover.
This occurs RANDOMLY, often during IDLE or low GPU load conditions.

Symptoms:

  • Complete system lockup (no SSH, no keyboard, no display)
  • Logs stop abruptly mid-write
  • No kernel panic or crash dump generated
  • Requires physical power button to recover

When logs are preserved (rare), the following errors appear:

[    9.072467] kernel: NVRM: GPU at PCI:0000:41:00: GPU-8f4499d4-d4fb-126f-65b5-d6319979c633
[    9.072471] kernel: NVRM: Xid (PCI:0000:41:00): 79, GPU has fallen off the bus.
[    9.072479] kernel: NVRM: kgspRcAndNotifyAllChannels_IMPL: RC all channels for critical error 79.
[    9.072679] kernel: NVRM: _kgspLogRpcSanityCheckFailure: GPU1 sanity check failed 0xf waiting for RPC response from GSP.
               Expected function 4097 (GSP_INIT_DONE) sequence 0 (0x0 0x0).
[    9.122798] kernel: NVRM: osInitNvMapping: *** Cannot attach gpu
  1. Is this a known GSP firmware issue affecting dual-RTX-5090 configurations?
  2. Is there a driver update (595.x or newer) that addresses this GSP timeout?
  3. Can you confirm if disabling GSP (nvidia.NVreg_EnableGpuFirmware=0) is safe for RTX 5090?
  4. Are there any kernel parameters or workarounds to prevent GSP timeouts?
  5. Is a driver fix planned for the 580/595 series?

This issue matches multiple confirmed bug reports:

  1. “RTX 5070 (10de:2f04) — Spontaneous GSP RPC timeout, GPU lost from bus on 580.126.18”
    RTX 5070 (10de:2f04) — Spontaneous GSP RPC timeout, GPU lost from bus on 580.126.18

  2. “[Bug] Hard System Lockup / GPU Lost From Bus under Load - RTX 5070”
    [Bug] Hard System Lockup / GPU Lost From Bus under Load - RTX 5070 (595.58.03) on Kernel 7.0.0 [Bug Report Attached]

  3. GitHub Issue #1080 - GSP heartbeat timeout
    RTX 5090 (GB202): GSP heartbeat timeout -> Xid 109/8 under Vulkan load via Proton (595.58.03, 590.48.01) · Issue #1080 · NVIDIA/open-gpu-kernel-modules · GitHub

Best regards,
Shakhizat

hello, @aplattner, Could you please help with this issue? I have posted in Discord channels and also sent an email to NVIDIA. It appears to be similar to: [580.105.08 Regression] Hibernate (S4) no longer powers off, fans stay on, resume still works

The GSP timeout is obviously caused by the GPU falling off the bus in the first place: how can it not time-out if the GPU is no longer there? ;-)

Now regarding falling off the bus, in the words of an NV eng:

So check these 3 things first and definitely the most recent driver (595.71 currently).
However if you browse this forum a bit, you will find that falling off the bus is something that Blackwell just tends to do for many ppl…

Hi, thanks for your reply, this might the same issue: RTX 5060 Ti (Blackwell) — GSP Firmware Crash causes GPU lockup (black screen, 100% fans, hard reset required) on nvidia-open 595.71.05 - Issues & Assistance - CachyOS Forum