Bug Report - 'GPU has fallen off the bus' randomly; NVIDIA GeForce RTX 4090 + NVIDIA GeForce RTX 5090 D dual setup

Hi,

My setup is a Linux, Ubuntu 24.04.4 LTS, linux kernel 6.17.0-14-generic with the NVIDIA GeForce RTX 4090 with a DP to a monitor and an NVIDIA GeForce RTX 5090 D for compute only. No bridge (data cable) between them as computations run independently.

For over 6 months the system was stable and ran more-or-less 24/7 with CUDA C computation without any problems. Suddenly 2 weeks ago, the system starts crashing up to 3 times a day, but then a period of 12 days with no crash and then 3 crashes in one day. The kernel reports the ā€˜GPU has fallen off the bus’, which gives a black screen and Xorg (2:21.1.12-1ubuntu1.5 amd64). The system still runs and I can ssh to the system and run nvidia-bug-report.sh, which hangs. It suggests to use the arguments –safe-mode and –extra-system-data which I appended.

The crashes happen also in idle state (P8), so this makes a power issue less likely?

The problem happen both with linux kernel 6.17 and linux 6.14 and also with NVIDIA driver nvidia-driver-580-open 580.126.09-0ubuntu0.24.04.1 amd64 as well as with nvidia-driver-590-open 590.48.01-0ubuntu0.24.04.1 amd64.
I tried also to turn off ReSize BAR in the BIOS, but that made no difference.
I tried different combinations of HDMI cables vs DP cables, output from the other card, but the system still crashes.

When the screen output is from the 5090, the systems hangs, and I cannot run the bug report script.

Finally, it is presumably always the 4090 that crashes, since its fan goes to max speed, Xorg is busy at 100% CPU and nvtop still sees the RTX 5090, although the kernel crash also kills the computations on the 5090.

The machine specs are:
NVIDIA GeForce RTX 4090 (DP monitor output)
NVIDIA GeForce RTX 5090 D (no output)
Mainboard:ASUS PRIME Z790-P
CPU: IntelĀ® Coreā„¢ i7-14700KF Ɨ 28
Memory: 64.0 GiB Asgaard at both 5500 and 4200 MHz (no difference)
Xorg 2:21.1.12-1
64bit ubuntu 24.04.4 LTS

Can I determine whether this is a power issue or a driver issue or something else?

I attached the bug report, which has more info.

nvidia-bug-report.log.gz (224.3 KB)

nvidia-bug-report.log.old.gz (153.6 KB)

Any help is highly appreciated!

Sincerely,
/sbgudnason

I pulled your bug report and went through the logs. Here’s what they show.

Which GPU is crashing

The Xid 79 events are on PCI:0000:07:00, which maps to your RTX 4090 (AD102, MSI). Three Xid 79 crashes on March 1 — at 13:52, 14:55, and 19:49 — all originating on 07:00. The 5090 D (PCI:0000:01:00) gets a secondary Xid 154 each time, which is the driver flagging both GPUs for recovery after the primary failure. That’s consistent with what you observed — 4090 fans to max, Xorg pinned, but nvtop still sees the 5090.

GSP RPC failure

Right before each Xid 79, the driver logged 495 consecutive rpcSendMessage failed with status 0x0000000f for the 4090 (GPU1 in driver ordering). The GSP on the 4090 stopped responding — the driver was trying to send diagnostic dump requests (function 78, DUMP_PROTOBUF_COMPONENT) and every one failed with NV_ERR_GPU_IS_LOST. The GSP sanity check shows it was stuck on a GSP_RM_CONTROL call (function 76, sequence 2614630) that never returned. The RPC history shows normal completion times (114–1691 µs) for prior calls, then nothing.

By the second crash at 14:55 the driver couldn’t even allocate memory for protocol buffers (prbEncStartAlloc: Can't allocate memory) — the system was already degraded from the earlier crash.

PCIe link state (post-crash)

The 4090 slot shows LnkCap Gen5 x16 but LnkSta Gen1 x16 in the bug report. This was captured after the crash, so Gen1 fallback is expected for a card that lost its bus — it doesn’t tell us the pre-crash link state. Your 5090 D is running at Gen4 x4, which is the electrical limit of its slot on the Z790-P.

Serial Number

The log shows GPU Board Serial Number: 0 associated with PCI:0000:01:00 — that’s the 5090 D. A zero serial can mean the EEPROM wasn’t programmed or can’t be read. This doesn’t directly cause the 4090’s crash, but it’s worth noting for completeness.

What this looks like

The 4090 is losing its PCIe link at idle. The fact that it happens in P8 state, across two driver versions (580.126.09, 590.48.01), across two kernels (6.14, 6.17), and with ReSize BAR on or off makes a pure driver regression less likely, though it can’t be fully ruled out. Six months of stability followed by sudden repeated failures is a pattern more consistent with a developing hardware or link-level issue.

A few things that could help isolate it:

  1. After a clean boot (before any crash), check whether the 4090 trained to full speed:
sudo lspci -vvv -s 07:00.0 | grep -i "lnksta"

If it shows Gen4 or Gen5, the link trains fine and something causes it to drop later. If it shows Gen1 on a fresh boot, the link is marginal from the start.

  1. If you can temporarily run with just the 4090 (5090 D physically removed), does the crash still occur? That isolates whether the 5090 D’s presence is contributing — shared PCIe root complex resources, power draw, or driver cross-architecture interaction.

  2. Have you tried a different PCIe power cable run to the 4090? If the PSU has separate cables, try switching which one feeds the 4090. What PSU are you running? A 4090 + 5090 D together can pull 900W+ from the 12V rail under transient spikes.

  3. Do you know what driver version you were running during the 6 months of stability? If an automatic update pushed a newer driver around when the crashes started, that would help narrow the timeline:

apt list --installed 2>/dev/null | grep nvidia-driver
journalctl --since "2026-01-15" | grep -i "nvidia.*install\|apt.*nvidia"

  1. If the link looks healthy on a fresh boot, a sustained PCIe stress test can help catch intermittent link degradation before it escalates to a full Xid 79. This tool pushes DMA traffic and monitors throughput, replay counters, and correctable errors in real time:
./gpu_pcie_path_validator --gpu 0000:07:00.0 --size-mib 4096 --window-ms 60000 --interval-ms 50

If the link is marginal, you’ll see rising replay counters or throughput dropping below 50% of rated before the crash. The Usage guide covers the full set of run modes: gpu-pcie-path-validator/docs/Usage.md at main Ā· parallelArchitect/gpu-pcie-path-validator Ā· GitHub

Dear Joe (parallelArchitect) from NVIDIA,

Thanks so much for the invaluable help and excellent analysis.
I may comment that I thought the 5090 D is running on the Gen5x16 PCI, whereas the 4090 has always been in the Gen4x4 slot; not sure if/why the logs show otherwise.
I’m not sure why the serial number for the 5090 D is 0, but thanks for the info.

Since the reports were submitted, I have been running the 5090 D in its PCI slot 01:00.0 (same as in the logs) for 19 days on P1 load (about 235W usage on average) without any crashes on the 590.48.01 driver and kernel 6.17.0-14.

I will try to answer all the questions:

  1. $ sudo lspci -vvv -s 07:00.0 | grep -i ā€œlnkstaā€
    LnkSta: Speed 2.5GT/s (downgraded), Width x4 (downgraded)
    LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+

I don’t see the Gen4 or Gen5 here (neither in the lines below the grep’ed ones); the x4 is consistent with what nvtop shows, whereas the 5090 D has this output (x16 and not downgraded):

$ sudo lspci -vvv -s 01:00.0 | grep -i ā€œlnkstaā€
LnkSta: Speed 2.5GT/s (downgraded), Width x16
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+

I do believe that both GPUs can run on Gen4 or Gen5 after clean boot (using for instance nvtop or nvidia-smi).

  1. I’m trying this from now on for a while; so far no issues (for 1 day and counting).

  2. I will try this too after the single GPU 4090 test with the original cable, there is another available cable from the PSU.
    My PSU is a Hangjia IP2050G gold server edition with a 2050W cap, which should be sufficient.

  3. Here are the outputs:

$ apt list --installed 2>/dev/null | grep nvidia-driver
nvidia-driver-590-open/noble-updates,noble-security,now 590.48.01-0ubuntu0.24.04.4 amd64 [installed]
$ journalctl --since ā€œ2026-01-15ā€ | grep -i ā€œnvidia.*install|apt.*nvidiaā€
Feb 16 13:09:53 betaprime sudo[11581]: bjarke : TTY=pts/0 ; PWD=/var/log/apt ; USER=root ; COMMAND=/usr/bin/apt install linux-image-6.8.0-1045-nvidia
Feb 28 14:37:27 betaprime sudo[5232]: bjarke : TTY=pts/0 ; PWD=/home/bjarke ; USER=root ; COMMAND=/usr/bin/apt install nvidia-driver-590-open

I did think of this myself, and I did notice that I have updated the linux kernel just before the crashes started happening (same NVIDIA driver though), see the attached apt_history.log

Because of this, I tried to boot the 6.14, but the crashes continued to happen, so I couldn’t really conclude that it was kernel 6.17.

Indeed, the entry in the above journalctl output on Feb 16 is me installing additional kernels to see if they also had the issue.
Feb 28, I tried to update to the 590 NVIDIA driver, to see if that fixed something.
Neither attempts solved the problem.

  1. I’ve run this tool and attached the report (report.txt — .json is apparently not supported by the NVIDIA forum); the first part is for 4090 and does not show an increase in PCIe replay counters (all zero).
    It does say State : LINK_DEGRADED, Link consistency : FALSE, but Replay counter increase : NONE.
    You are probably much better at interpreting this output.

Let me know if you need any further data.

Thanks a lot again for the help!

best regards,
/sbgudnason

(attachments)

report.txt (9.87 KB)

Hi sbgudnason,

Quick clarification — I’m not from NVIDIA. I’m a community contributor who builds open-source GPU diagnostic tooling. Glad the analysis and tools have been useful.

Your PCIe validator report shows the topology clearly.

RTX 4090 at 07:00.0 (via root port 00:1d.0):

  • Negotiated pre-load: Gen1 x4
  • Negotiated post-load: Gen4 x4
  • Speed recovered to Gen4 under load, width stayed at x4
  • Throughput: 6.08 GB/s (~19% of theoretical) — no replay/AER errors observed

RTX 5090 D at 01:00.0 (via root port 00:01.0):

  • Negotiated pre-load: Gen1 x16
  • Negotiated post-load: Gen5 x16
  • Throughput: 22.03 GB/s (~35% of theoretical) — no replay/AER errors observed

The 4090 is running at x4 width. The root port 00:1d.0 is PCH-connected on the Z790, which is typically x4 electrical. You confirmed the 4090 has always been in this slot and was stable there for 6 months, so x4 is the baseline rather than the trigger. In that sense, the reduced width is better treated as reduced operating margin rather than a direct cause.

The apt history log shows the system upgraded from kernel 6.14.0-37 to 6.17.0-14 on Feb 6. Crashes started around Feb 22. The 5090 D has been stable for 19 days solo on the same driver (590.48.01). The crashes therefore correlate with the 4090 operating behind the PCH x4 root port after the kernel change.

Since the crashes occur at idle (P8), the PCIe link power-state transition path is a plausible area to investigate — the link may enter lower-power states at idle (e.g. ASPM), and failures during the transition back to an active state could lead to device loss.

So the working chain is:

idle / low-power state
→ link/device transition back to active
→ GPU becomes inaccessible (Xid 79)
→ recovery escalation (Xid 154)

To narrow this further:

  1. Swap the 4090 into the CPU x16 slot (where the 5090 D sits). If the crashes stop, it points to the PCH x4 slot/path. If the crashes persist on the 4090 in the CPU slot, it points to a 4090-specific issue (card or driver path).

  2. Test with the kernel parameter ā€œpcie_aspm=offā€ to rule out link power management effects.

  3. If you can locate and boot the exact 6.14.0-37 kernel, that is the cleanest regression test.

Dear Joe/parallelArchitect,

Thanks for the message!
I’ve been running tests with the 4090 alone (no 5090 D) in the x4 slot (07:00.0) without crashes, since the last message.
Now I’ve switched the 4090 to the x16 Gen5 PCIe slot and will do testing there with the 6.14.0-37 kernel and pcie_aspm=off kernel parameter.

Upon removing the 4090, I’ve noticed that the mechanical support had been slightly depressed into the grid, the plastic material not being as sturdy as the person that installed the GPU probably thought it was (the original metal support that comes with the 4090 is too large to fit into the cabinet due to the PSU/hard-disk cage).

I’m afraid the crashes could be the GPU physically not being well connected to the bus; I didn’t suspect this since the GPU always came back online on reboot, leading me to think it was a driver or kernel problem.
I’ll continue testing, but may close this bug if the no crashes occur.

I’m sorry if I’ve waisted your time reading the bug reports.
But I really appreciate all the help and found your GPU diagnostic tool very helpful in testing the possibilities.

Best regards,
/sbgudnason

Glad it helped — the quality of the data you provided made it possible to narrow this down properly. Having the logs and validator output made a big difference in understanding what the system was actually doing.

Your follow-up testing (isolating the 4090 and checking different slot/path behavior) is exactly the right direction. The mechanical support observation also lines up with the kind of intermittent PCIe issues that can lead to this pattern.

At this point it should become clear whether the issue is tied to the slot/path or the card itself.

If anything new shows up in the logs during testing, feel free to share — happy to take another look.

Thanks a lot!
I’ll keep you updated if and when further crashes happen.
If it turns out to be stable now, then I’ll close this bug.
Thanks again for the help with the analysis!!!
/sbgudnason

1 Like