RTX 5090 FE - Hard System Crash Under CUDA/LoRA Load - Xid 119 GSP Timeout (Linux Mint)

RTX 5090 FE - Hard System Crash Under CUDA/LoRA Load - Xid 119 GSP Timeout (Linux Mint)

Hello everyone,

I’m hoping to get some guidance from NVIDIA engineers and the community on a recurring issue I’m experiencing with my RTX 5090 Founders Edition. NVIDIA Customer Care has kindly directed me here for technical validation, so I’ve gathered as much diagnostic information as possible.

Thank you in advance for taking the time to look at this!


System Configuration

| Component | Specification |

|-----------|---------------|

| GPU | NVIDIA GeForce RTX 5090 Founders Edition |

| CPU | AMD Ryzen 9 9950X3D (16-core) |

| Motherboard | ASUS ROG STRIX X870-A GAMING WIFI |

| RAM | 128GB DDR5 |

| PSU | be quiet! Dark Power 14 1200W (ATX 3.1, 80+ Titanium) |

| OS | Linux Mint 22.2 |

| Kernel | 6.14.0-37-generic |

| Driver | 580.95.05 |

| CUDA | 13.0 |

| VBIOS | 98.02.2E.00.03 |


The Error: Xid 119 - GSP Timeout


NVRM: _kgspLogXid119: ********************************* GSP Timeout **********************************

NVRM: Xid (PCI:0000:01:00): 119, Timeout after 6s of waiting for RPC response from GPU0 GSP!

Expected function 4097 (GSP_INIT_DONE) sequence 0 (0x0 0x0).

NVRM: _kgspLogXid119: ********************************************************************************


Problem Description

The system experiences an instant, hard power-off when the GPU is placed under sustained CUDA compute load. There are no kernel panics, driver crashes, or software errors preceding the event. The machine simply clicks off and reboots. This behaviour is 100% reproducible under specific workloads.

Workloads that trigger the crash:

  • LoRA / AI Training (using kohya_ss with CUDA)

  • LLM Inference (loading large models into VRAM via Ollama, e.g., Qwen 32B)

The system is perfectly stable at idle and under normal desktop use. It only crashes under high CUDA compute load.


Crash History (from system logs)

The last reboot command shows sessions ending with a “still running” flag (no clean shutdown), indicating hard power cuts:

| Date | Boot Time | Crash Time | Duration | Context |

|------|-----------|------------|----------|---------|

| Dec 16 | 05:32 | 13:06 | ~7.5 hrs | Crash during LoRA training |

| Dec 16 | 03:55 | 05:32 | ~1.5 hrs | Crash during Ollama/LLM workload |

| Dec 15 | 20:32 | 03:55 | ~7.5 hrs | Overnight session hard crash |

| Dec 15 | 12:03 | 20:32 | ~8.5 hrs | Day session hard crash |


Troubleshooting Steps Already Performed

I’ve done my best to rule out software, driver, or configuration issues on my end:

  1. Power Limiting:
  • Reduced power limit to 450W via nvidia-smi -pl 450 - still crashes.

  • Reduced power limit to 400W (the GPU’s minimum) - still crashes.

  1. PSU Configuration:
  • PSU is a be quiet! Dark Power 14 1200W (ATX 3.1 compliant, Titanium rated, well above NVIDIA’s 1000W recommendation).

  • Overclocking Key (OCK) set to ON (Single-Rail Mode for full 1200W on one rail).

  • Using the native 12V-2x6 cable (no adapters).

  • Reseated the cable on both GPU and PSU ends.

  • Tested both PCIe 5.0 ports on the PSU.

  1. Driver/Software:
  • Running the latest production driver (580.95.05).

  • Clean driver installation.

  • No overclocking or custom fan curves applied.

  1. Workaround Attempt:
  • Currently testing with a GPU clock lock at 2100MHz to limit power draw, though this isn’t really a viable long-term solution for a workstation-class card.

My Assessment

Given that:

  • The crash occurs even at the GPU’s minimum 400W power limit.

  • The PSU (1200W, ATX 3.1, Titanium rated) significantly exceeds NVIDIA’s recommended 1000W.

  • The Xid 119 error points to a GSP communication timeout.

I’m wondering if this could be a hardware issue with this particular GPU unit, possibly related to the GSP or power delivery circuitry. I’d really appreciate any thoughts on this.


Questions for the Community / NVIDIA Engineers

  1. Is the Xid 119 GSP timeout typically indicative of a hardware issue, or are there any software/driver workarounds I might try?

  2. Are there any additional diagnostic commands or logs I should capture during a crash attempt that would be helpful?

  3. Has anyone else experienced similar issues with the RTX 5090 FE under sustained CUDA workloads on Linux?

If the conclusion is that this is likely a hardware defect, I’m happy to proceed with the RMA process as NVIDIA Customer Care has indicated.


Additional Information Available

I’m happy to provide any of the following if it would help:

  • Full dmesg output

  • journalctl -b -1 (previous boot before crash)

  • GPU monitoring logs during load

  • Any other logs or data needed


This is my primary workstation, purchased specifically for AI development work, so I’m very keen to get it working reliably. I really appreciate any help or guidance you can offer!

Thank you so much for your time.

Exact same problem here. Also using a be quiet! Dark Power 14 1200W on ASUS ROG Astral LC 5090 OC. Reproducable 100% on comfui-fluxtrainer (based on Kohya_ss) and sometimes on FLUX.1 inference if stress pressured.

Did you find anything?

Hi,

I am experiencing the same issue running NVIDIA DeepStream pipelines, though it happens intermittently.

Before January 8th, everything was running perfectly with 0 crashes. After an Ubuntu update, I started experiencing random crashes whenever I launch DeepStream pipelines, even with default configurations.

I reduced the power limit to 400W, but it still crashes randomly. Often, after a crash, when I reboot, the motherboard beeps indicating that no GPU is detected.

System Specs:
GPU: MSI RTX 5090
CPU: Intel Xeon w3-2435
RAM: Samsung M321R2GA3BB6-CQKET 2x16GB
MOTHERBOARD: HP Z4 G5 Workstation Desktop PC
PSU: 1125W

Original Environment:
Driver: 580.95.05
CUDA: 12.8 (for DeepStream)
TensorRT: 10.9.0.34 compiled for CUDA 12.8
DeepStream: 8.0

The crashes started after I installed the following updates on Ubuntu 24.04.3 LTS:

Start-Date: 2026-01-08 06:14:13
Commandline: /usr/bin/unattended-upgrade
Upgrade: libglib2.0-dev-bin:amd64 (2.80.0-6ubuntu3.5, 2.80.0-6ubuntu3.6), libglib2.0-bin:amd64 (2.80.0-6ubuntu3.5, 2.80.0-6ubuntu3.6), libglib2.0-dev:amd64 (2.80.0-6ubuntu3.5, 2.80.0-6ubuntu3.6), gir1.2-glib-2.0:amd64 (2.80.0-6ubuntu3.5, 2.80.0-6ubuntu3.6), libglib2.0-data:amd64 (2.80.0-6ubuntu3.5, 2.80.0-6ubuntu3.6), libgirepository-2.0-0:amd64 (2.80.0-6ubuntu3.5, 2.80.0-6ubuntu3.6), gir1.2-glib-2.0-dev:amd64 (2.80.0-6ubuntu3.5, 2.80.0-6ubuntu3.6), libglib2.0-0t64:amd64 (2.80.0-6ubuntu3.5, 2.80.0-6ubuntu3.6), libglib2.0-0t64:i386 (2.80.0-6ubuntu3.5, 2.80.0-6ubuntu3.6)
End-Date: 2026-01-08 06:14:17

Start-Date: 2026-01-08 08:33:11
Commandline: aptdaemon role=‘role-commit-packages’ sender=‘:1.9332221’
Upgrade: libblkid-dev:amd64 (2.39.3-9ubuntu6.3, 2.39.3-9ubuntu6.4), libxnvctrl0:amd64 (590.44.01-0ubuntu1, 590.48.01-0ubuntu1), netplan-generator:amd64 (1.1.2-2~ubuntu24.04.2, 1.1.2-8ubuntu1~24.04.1), libsmartcols1:amd64 (2.39.3-9ubuntu6.3, 2.39.3-9ubuntu6.4), udev:amd64 (255.4-1ubuntu8.11, 255.4-1ubuntu8.12), systemd-oomd:amd64 (255.4-1ubuntu8.11, 255.4-1ubuntu8.12), python3.13:amd64 (3.13.10-1+noble1, 3.13.11-1+noble1), dhcpcd-base:amd64 (1:10.0.6-1ubuntu3.1, 1:10.0.6-1ubuntu3.2), libmbim-utils:amd64 (1.31.2-0ubuntu3, 1.31.2-0ubuntu3.1), mutter-common-bin:amd64 (46.2-1ubuntu0.24.04.12, 46.2-1ubuntu0.24.04.13), google-chrome-stable:amd64 (143.0.7499.40-1, 143.0.7499.192-1), libmount-dev:amd64 (2.39.3-9ubuntu6.3, 2.39.3-9ubuntu6.4), systemd-timesyncd:amd64 (255.4-1ubuntu8.11, 255.4-1ubuntu8.12), libpipewire-0.3-common:amd64 (1.0.5-1ubuntu3.1, 1.0.5-1ubuntu3.2), libmbim-glib4:amd64 (1.31.2-0ubuntu3, 1.31.2-0ubuntu3.1), libpam-systemd:amd64 (255.4-1ubuntu8.11, 255.4-1ubuntu8.12), pipewire-pulse:amd64 (1.0.5-1ubuntu3.1, 1.0.5-1ubuntu3.2), libgdm1:amd64 (46.2-1ubuntu1~24.04.4, 46.2-1ubuntu1~24.04.5), python3-netplan:amd64 (1.1.2-2~ubuntu24.04.2, 1.1.2-8ubuntu1~24.04.1), libpython3.13-stdlib:amd64 (3.13.10-1+noble1, 3.13.11-1+noble1), libmutter-14-0:amd64 (46.2-1ubuntu0.24.04.12, 46.2-1ubuntu0.24.04.13), libsystemd0:amd64 (255.4-1ubuntu8.11, 255.4-1ubuntu8.12), libsystemd0:i386 (255.4-1ubuntu8.11, 255.4-1ubuntu8.12), libmount1:amd64 (2.39.3-9ubuntu6.3, 2.39.3-9ubuntu6.4), libmount1:i386 (2.39.3-9ubuntu6.3, 2.39.3-9ubuntu6.4), libnss-systemd:amd64 (255.4-1ubuntu8.11, 255.4-1ubuntu8.12), libudev-dev:amd64 (255.4-1ubuntu8.11, 255.4-1ubuntu8.12), pipewire:amd64 (1.0.5-1ubuntu3.1, 1.0.5-1ubuntu3.2), mutter-common:amd64 (46.2-1ubuntu0.24.04.12, 46.2-1ubuntu0.24.04.13), util-linux:amd64 (2.39.3-9ubuntu6.3, 2.39.3-9ubuntu6.4), gnome-shell:amd64 (46.0-0ubuntu6~24.04.11, 46.0-0ubuntu6~24.04.12), systemd:amd64 (255.4-1ubuntu8.11, 255.4-1ubuntu8.12), libudev1:amd64 (255.4-1ubuntu8.11, 255.4-1ubuntu8.12), libudev1:i386 (255.4-1ubuntu8.11, 255.4-1ubuntu8.12), fdisk:amd64 (2.39.3-9ubuntu6.3, 2.39.3-9ubuntu6.4), gnome-settings-daemon-common:amd64 (46.0-1ubuntu1, 46.0-1ubuntu1.24.04.1), python3.13-venv:amd64 (3.13.10-1+noble1, 3.13.11-1+noble1), libfdisk1:amd64 (2.39.3-9ubuntu6.3, 2.39.3-9ubuntu6.4), systemd-dev:amd64 (255.4-1ubuntu8.11, 255.4-1ubuntu8.12), eject:amd64 (2.39.3-9ubuntu6.3, 2.39.3-9ubuntu6.4), gdm3:amd64 (46.2-1ubuntu1~24.04.4, 46.2-1ubuntu1~24.04.5), gnome-shell-extension-desktop-icons-ng:amd64 (46+really47.0.9-1ubuntu4, 46+really47.0.9-1ubuntu5), gnome-settings-daemon:amd64 (46.0-1ubuntu1, 46.0-1ubuntu1.24.04.1), libspa-0.2-bluetooth:amd64 (1.0.5-1ubuntu3.1, 1.0.5-1ubuntu3.2), libuuid1:amd64 (2.39.3-9ubuntu6.3, 2.39.3-9ubuntu6.4), clickhouse-client:amd64 (25.11.2.24, 25.12.2.54), uuid-runtime:amd64 (2.39.3-9ubuntu6.3, 2.39.3-9ubuntu6.4), systemd-resolved:amd64 (255.4-1ubuntu8.11, 255.4-1ubuntu8.12), gir1.2-mutter-14:amd64 (46.2-1ubuntu0.24.04.12, 46.2-1ubuntu0.24.04.13), libmbim-proxy:amd64 (1.31.2-0ubuntu3, 1.31.2-0ubuntu3.1), gstreamer1.0-pipewire:amd64 (1.0.5-1ubuntu3.1, 1.0.5-1ubuntu3.2), uuid-dev:amd64 (2.39.3-9ubuntu6.3, 2.39.3-9ubuntu6.4), pipewire-audio:amd64 (1.0.5-1ubuntu3.1, 1.0.5-1ubuntu3.2), pipewire-bin:amd64 (1.0.5-1ubuntu3.1, 1.0.5-1ubuntu3.2), gnome-shell-common:amd64 (46.0-0ubuntu6~24.04.11, 46.0-0ubuntu6~24.04.12), nvidia-settings:amd64 (590.44.01-0ubuntu1, 590.48.01-0ubuntu1), gir1.2-gdm-1.0:amd64 (46.2-1ubuntu1~24.04.4, 46.2-1ubuntu1~24.04.5), rfkill:amd64 (2.39.3-9ubuntu6.3, 2.39.3-9ubuntu6.4), mount:amd64 (2.39.3-9ubuntu6.3, 2.39.3-9ubuntu6.4), clickhouse-common-static:amd64 (25.11.2.24, 25.12.2.54), libspa-0.2-modules:amd64 (1.0.5-1ubuntu3.1, 1.0.5-1ubuntu3.2), libwhoopsie0:amd64 (0.2.77build3, 0.2.77ubuntu0.1), libsystemd-shared:amd64 (255.4-1ubuntu8.11, 255.4-1ubuntu8.12), netplan.io:amd64 (1.1.2-2~ubuntu24.04.2, 1.1.2-8ubuntu1~24.04.1), libpipewire-0.3-0t64:amd64 (1.0.5-1ubuntu3.1, 1.0.5-1ubuntu3.2), clickhouse-server:amd64 (25.11.2.24, 25.12.2.54), systemd-sysv:amd64 (255.4-1ubuntu8.11, 255.4-1ubuntu8.12), libblkid1:amd64 (2.39.3-9ubuntu6.3, 2.39.3-9ubuntu6.4), libblkid1:i386 (2.39.3-9ubuntu6.3, 2.39.3-9ubuntu6.4), whoopsie:amd64 (0.2.77build3, 0.2.77ubuntu0.1), libpipewire-0.3-modules:amd64 (1.0.5-1ubuntu3.1, 1.0.5-1ubuntu3.2), nvidia-firmware-580-580.95.05:amd64 (580.95.05-0ubuntu0.24.04.2, 580.95.05-0ubuntu0.24.04.3), bsdutils:amd64 (1:2.39.3-9ubuntu6.3, 1:2.39.3-9ubuntu6.4), libnetplan1:amd64 (1.1.2-2~ubuntu24.04.2, 1.1.2-8ubuntu1~24.04.1), bsdextrautils:amd64 (2.39.3-9ubuntu6.3, 2.39.3-9ubuntu6.4), pipewire-alsa:amd64 (1.0.5-1ubuntu3.1, 1.0.5-1ubuntu3.2)
End-Date: 2026-01-08 08:33:54

Start-Date: 2026-01-08 08:34:27
Commandline: aptdaemon role=‘role-commit-packages’ sender=‘:1.9332221’
Upgrade: linux-firmware:amd64 (20240318.git3b128b60-0ubuntu2.21, 20240318.git3b128b60-0ubuntu2.22)
End-Date: 2026-01-08 08:34:34

What I have tried:

  • Limited power to 400W.
  • Upgrading to the 590 driver branch.
  • Downgrading to the original driver with a clean installation (580.95.05).
  • Downgrading to driver 570.211.01 (clean installation, headless/no visual environment).
  • Setting GPU clocks: --lock-gpu-clocks=2100,2100 before running the pipeline.

Current Situation:
Yesterday, I had one crash at the start of the day. After that, I ran my pipeline about 30 times with no errors. I didn’t shut down the PC, but today I got another crash (black screen and GPU not detected upon reboot).

Here are the logs I get when it crashes:

ene 22 08:49:39 goia-pc-HP-Z4-G5-Workstation-Desktop-PC kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000ca7e:6:0:0x0000000f
ene 22 08:49:39 goia-pc-HP-Z4-G5-Workstation-Desktop-PC kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000ca7e:4:0:0x0000000f
ene 22 08:49:40 goia-pc-HP-Z4-G5-Workstation-Desktop-PC kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000ca7e:0:0:0x0000000f
ene 22 08:49:40 goia-pc-HP-Z4-G5-Workstation-Desktop-PC kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000ca7e:2:0:0x0000000f
ene 22 08:49:40 goia-pc-HP-Z4-G5-Workstation-Desktop-PC kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000ca7e:4:0:0x0000000f
ene 22 08:49:40 goia-pc-HP-Z4-G5-Workstation-Desktop-PC kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000ca7e:6:0:0x0000000f
– Boot 68d79d5335ac4537bb3da60040c158ae –
ene 22 09:01:34 goia-pc-HP-Z4-G5-Workstation-Desktop-PC kernel:
ene 22 09:01:56 goia-pc-HP-Z4-G5-Workstation-Desktop-PC gdm-password][5034]: gkr-pam: unable to locate daemon control file
ene 22 09:01:57 goia-pc-HP-Z4-G5-Workstation-Desktop-PC gdm3[1669]: Gdm: on_display_added: assertion ‘GDM_IS_REMOTE_DISPLAY (display)’ failed
ene 22 09:01:59 goia-pc-HP-Z4-G5-Workstation-Desktop-PC systemd[5077]: Failed to start app-gnome-gnome\x2dkeyring\x2dpkcs11-5820.scope - Application launched by gnome-sessio>
ene 22 09:01:59 goia-pc-HP-Z4-G5-Workstation-Desktop-PC systemd[5077]: Failed to start app-gnome-xdg\x2duser\x2ddirs-5837.scope - Application launched by gnome-session-binar>
ene 22 09:02:01 goia-pc-HP-Z4-G5-Workstation-Desktop-PC gdm3[1669]: Gdm: on_display_removed: assertion ‘GDM_IS_REMOTE_DISPLAY (display)’ failed
– Boot d517f0bf88de4c8eb6f911200ceddc5d –
ene 22 09:16:59 goia-pc-HP-Z4-G5-Workstation-Desktop-PC kernel:
ene 22 09:17:14 goia-pc-HP-Z4-G5-Workstation-Desktop-PC gdm-password][4749]: gkr-pam: unable to locate daemon control file
ene 22 09:17:15 goia-pc-HP-Z4-G5-Workstation-Desktop-PC gdm3[1715]: Gdm: on_display_added: assertion ‘GDM_IS_REMOTE_DISPLAY (display)’ failed
ene 22 09:17:17 goia-pc-HP-Z4-G5-Workstation-Desktop-PC systemd[4789]: Failed to start app-gnome-gnome\x2dkeyring\x2dpkcs11-5565.scope - Application launched by gnome-sessio>
ene 22 09:17:19 goia-pc-HP-Z4-G5-Workstation-Desktop-PC systemd[4789]: Failed to start app-gnome-user\x2ddirs\x2dupdate\x2dgtk-6000.scope - Application launched by gnome-ses>
ene 22 09:17:20 goia-pc-HP-Z4-G5-Workstation-Desktop-PC gdm3[1715]: Gdm: on_display_removed: assertion ‘GDM_IS_REMOTE_DISPLAY (display)’ failed

Get yourself a decent PSU. That’s what I did. Eg Seasonic Prime TX-1600

System running stable now.

I moved the PC from the power strip to a wall outlet (leaving only the monitors and chargers there), and it hasn’t crashed since. Anyways, I should buy a decent PSU as you said.
Thanks!

Ok, I also use the be quiet! Dark Power 14 1200W and I am having the same issue. For me it happens when starting vLLM or using WAN 2.2. I was monitoring power draw and it looks fine. Can’t imagine what is happening. I get nothing in the system logs. Where did you get “Xid 119 - GSP Timeout” ?

Ok, there are many reports of issue using be quiet! PSUs with a 5090 on Reddit. The solution is to turn the “OCK” switch on the back back of the PSU to the ”on” position. This was confirmed to me by be quiet! tech support as well. Even though this seems to works for most users this did not solve the issue for me. I am sending the PSU back and purchased a ASUS ROG Strix 1200W which seems to work stable and thus solved the issue for me.

Hello, has the problem been resolved yet?

It was the PSU - the dark power 14 1200 wasn’t strong enough for the 5090. bought a corsair and haven’t had issues since!