Keep getting "GPU has fallen off the bus" with 3090 cards on Gigabyte MZ32-AR1 Rev 3.0 motherboard

I have EPYC 7763 CPU in Gigabyte MZ32-AR1-rev-30 motherboard, with 1TB 3200MHz RAM. The motherboard is new and has no issues except with Nvidia cards falling off the bus periodically.

I tried so far:

Removing all PCI-E devices except Nvidia cards - did not help. However, I noticed Nvidia driver can cause other PCI-E to malfunction when it falls off the bus, if they are present. Without Nvidia cards, no stability issues.

I use server grade IBM PSU rated 2880W to power four 3090 GPUs, but the issue usually happens when they are idle, so I am sure power is not the issue.

Switching to PCI-E 3.0 or even PCI-E 2.0 - did not help either, suggesting signal integrity is not an issue. Enabling PCI-E advanced error reporting in BIOS confirmed there are no errors.

Various kernel flags that did NOT help:

amd_iommu=on
kvm.ignore_msrs=1
iommu=pt pcie_aspm=off
rcutree.rcu_idle_gp_delay=1
pci=realloc=off

With these I kept getting “GPU has fallen off the bus” every 1-2 days, but sometimes in less than an hour after boot, quite random.

Disabling nvidia.NVreg_EnableGpuFirmware=0 (without other kernel options) and then running the system for a while on Performance mode:

for i in $(seq 0 $(expr ${GPU_COUNT} - 1)); do
  nvidia-settings -a "[gpu:${i}]/GpuPowerMizerMode=1"
done

…then switching back to Adaptive mode after some days, resulted in 16 days uptime, at which point I powered down normally to upgrade my M.2 SSD. Another thing, during runtime I also applied suspend/resume workaround to bring down power in idle or partial load states: Reddit - The heart of the internet - which is another Nvidia driver bug that causes cards to consume 10W-15W more each than they should in idle mode, unless suspend/resume workaround is used.

I thought that may be nvidia.NVreg_EnableGpuFirmware=0 helped, since 16 days uptime was far greater than 1-3 days uptimes I was getting before when encountering the issue. But after I booted back up (this time, using Adaptive mode from the start, no suspend/resume trick, to check if nvidia.NVreg_EnableGpuFirmware=0 alone made the difference), after less than 3 days, GPUs feel off the bus again. I am attaching debug log. I had to run it with --safe-mode otherwise it hangs forever.

I am fighting this problem for over two months now, but so far it seems to be related to power management. I switched back to Performance mode and will see if I encounter the issue, but the problem is, in Performance mode my rig consumes extra almost 0.5 kW while idle on GPUs alone, which is a lot. In adaptive mode, four GPUs usually consume less than 100W while idle.

Also, GPUs came from my previous rig which was based on a gaming motherboard, and never had this issue, connected using the same PCI-E 4.0 cables.

I think this is Nvidia driver bug, since clearly power modes or power saving features have an effect on it, while everything else including switching to slower PCI-E 3.0 or 2.0, or removing other PCI-E devices, has no effect at all. That said, since it takes a lot of time to encounter the issue, it is hard to say what helped and what did not, but I hope this is enough to look into it. If anyone has any other ideas or suggestions what else to try to narrow down the issue, I would greatly appreciate them.

nvidia-bug-report.log.gz (389.9 KB)

2 Likes

It’s probably power supply issue rather than anything..
Can you test with different PSU by any chance?

I did, I have another power supply and tried to connecting to it instead, no difference. And tried connecting less GPUs, no difference. It is also worth mentioning that the IBM 2880W PSU powering these cards, worked on my previous rig without issues, so it is known to work well with them. Oscilloscope shows practically perfect 12.3V voltage even under full load on all GPUs. IBM 2880W PSU also has many power cables so I tried different ones too just in case, again no difference. Which makes sense, especially with it happening when idle load is about 100W or less in Adaptive mode - which is practically nothing for 2880W PSU.

Another thing I did not mention - my workstation powered from professional 6kW online UPS with sixteen 12V batteries (all battery voltages maintained equalized both during charging or discharging), with total capacity of 2304 Wh. So, power supply is as close to perfect as it gets in my case, using server grade components.

The first time I think that something made difference in terms of stopping GPUs falling off the bus every 1-3 days (after many weeks of testing) was when I disabled GPU firmware (nvidia.NVreg_EnableGpuFirmware=0) and forced Performance mode, then after few days switched to adaptive mode, and trying suspend/resume at some point. So after 16 days of uptime (first time working so long without issues, until I turned off it myself for SSD upgrade), I thought that may be nvidia.NVreg_EnableGpuFirmware=0 helped, but it turned not be the case - when rebooted with just nvidia.NVreg_EnableGpuFirmware=0 without other tricks, GPUs fell off the bus in less than 3 days.

Since with just disabled GPU firmware they still fell of the bus, it is either Performance mode and/or suspend/resume trick made the difference (may be it made Adaptive mode more stable?), possibly with combination with nvidia.NVreg_EnableGpuFirmware=0.

I wrote a script gpulist that shows all GPUs and actual PCI-E width and speed. This is unlike nvidia-setting which seem to show max PCI-E gen and max width (so for example if I plug in GPU in x8 slot, nvidia-settings will still claims using x16, while my script correctly shows x8 in such a case). And I discovered something interesting:

When in Performance mode, all GPUs are always in Gen4 mode (currently all four connected to x16 slots):

NVIDIA GeForce RTX 3090 Ampere GPU-480d6f8d-a91a-caea-7e42-12beebfc4fb3 00000000:01:00.0 Gen4 x16
NVIDIA GeForce RTX 3090 Ampere GPU-5c2fee09-0008-f94c-ec2d-64a9db828ddc 00000000:41:00.0 Gen4 x16
NVIDIA GeForce RTX 3090 Ampere GPU-27f3593e-6c7e-3ed2-a3a7-e2684bb8c08c 00000000:81:00.0 Gen4 x16
NVIDIA GeForce RTX 3090 Ampere GPU-46eeaee1-07e9-80ab-e82e-33e046d48ac0 00000000:C1:00.0 Gen4 x16

When in Adaptive mode, my script shows Gen1 when they are idle. They still fall of the bus even if max set to Gen2 in BIOS (currently set to Gen4), so may be switching PCI-E generations triggers the bug in some cases, at least on this motherboard (not sure if this is the case, but just a guess). This may be related to the other Nvidia driver bug that cards in Adaptive mode may stuck in consuming 20W-30W instead of 10W-20W in idle state unless suspended/resumed, so may be combination of Gen switching and bad power state with the server motherboard triggers the bug somehow.

Obviously, this is quite complicated - and that’s after more than two months checking and excluding other possible reasons, including trying other cables, different driver versions, kernel parameters, etc. I tried so may things that it is hard to list them all at this point. I am posting this in the hope the log suggests what could be wrong here, and may be help to fix the bug in drivers.

Currently I am testing with nvidia.NVreg_EnableGpuFirmware=0 in Performance mode, without suspend/resume workaround, but it may take days or more than a week before I know with high certainty if it makes a difference (extra 0.5 kW idle load from GPUs is a huge issue for me, but if Performance mode proves to be stable, it may help to narrow down possible cause).

I remember read somewhere that on some server motherboards Nvidia drivers may be less stable and may have “GPU fell off the bus” issue - I thought I saved the link, but could not find it when writing my previous post, and still cannot find it, but thought I mention this. That link from what I remember did not provide any workarounds or attempts to narrow down the issue, so probably does not matter that much. But if this is true, it would explain why I did not have this issue on my previous gaming motherboard, using the same cables, the same PSU, and the same video cards.

One more thing - in the beginning, I suspected defective motherboard (even though it was brand new), so I had it replaced with another, and no change - besides, the motherboard proved to be absolutely stable on its own with other PCI-E devices, which also points to Nvidia driver bug as the most likely cause.

Since this is workstation I use for work, not just personal stuff, I have no choice but to look for a solution or at least workaround, and I really hope Nvidia employees can take a look into this, but if anyone from the community can make additional suggestions what else to try or what additional debug information to provide - I also would appreciate that.

This did not take long, GPUs fell of the bus again. So, Performance mode on all cards did not help - in less than 9 hours of uptime, GPUs fell of the bus once more. After reboot, with Adaptive mode, I do not see higher than usual idle power consumption, so the suspend/resume workaround is not applicable, and it wasn’t applicable in the Performance mode either, so I conclude the issue at hand is not related to the other Nvidia driver bug with higher than usual idle power consumption.

Here, I attach the latest debug log:

nvidia-bug-report.log.gz (245.4 KB)

But I do not know what else to try this time?

I think I exhausted all possibilities, and bug seems to manifest completely at random - may happen in just few hours, or not happen for more than two weeks that I started to think it was solved. But apparently not.

However, I ruled out hardware or power related issues, so Nvidia driver bug on the MZ32-AR1-rev-30 motherboard is only explanation as far as I can tell, at least in combination with 3090 cards. I know video cards, cables and PSU are good because they work well on a gaming motherboard where I had them previously. On the new motherboard, reducing number of connected GPUs or trying different cable or PSU does not help. Replacing new MZ32 AR1 motherboard with another new one is also I mentioned trying, and made no difference, and the motherboard itself is completely stable without Nvidia cards, so I am sure it is good. BIOS is of the latest available version. Hence, Nvidia driver bug is the only remaining explanation.

If anyone can suggest some possible workaround or any additional information I may need to provide to pinpoint the issue further - I would be very grateful.

Did you try to downgrade nvidia driver version?

I hear that RTX GeForce 4000 series or older cards get works on the older proprietary drivers with don’t use GSP firmware.

Additionally, I have nearly same problem on RTX 5070 card.

In my case, RTX 5070 card says GPU has fallen off the bus when GPU is idle times, and this is happened at randomly.

I tried to these instructions, but it’s not fixed:

  • Turn off to ASPM of linux by pcie_aspm=off
  • Turn off to PCIe power management of linux by pcie_port_pm=off
  • Set power limit to lower by nvidia-smi
  • Turn off ASPM by UEFI BIOS
  • Reseat to RTX 5070 card
  • Downgrade linux kernel to older versions

At the first time of this problem happened, I think this is PSU issue. But I guess that this is not triggered by PSU.

I found some nearly cases this forum or another place, and these reports says this problem happened on laptop too, or replace PSU does not effect.

As of PSU, in my case I am using server IBM 2880W PSU to power the cards, and my workstation is protected by 6kW online UPS, and I also tried different cables, different PSU, and even replacing motherboard - so I am sure at this point that the issue caused by Nvidia driver.

I did not try to downgrade yet, not sure to what version to downgrade to? I think I tested 550 and 570 proprietary drivers, and now I am testing 575 “open” Nvidia driver without any workarounds applied, so far after 3 days and 3 hours still working well, but only time will tell if it remains to be the case. I will update this thread if I experience the issue again with 575 driver.

I did not try to downgrade yet, not sure to what version to downgrade to?

I hear that the driver version of before support to Blackwell (GeForce RTX 5000 series) is stable.

The version of 565.77 is maybe latest version about before support to Blackwell.

But I can’t confirm that these drivers are stable by myself. At the moment, I don’t have GeForce RTX 3000 series, sorry.

EDIT: I forgot to I have GeForce RTX 2060. This card is stable by proprietary version of 575 driver, but it doesn’t help to your case.

Thank you for your suggestion. After further testing, nvidia-dkms-575-open also proved to be unstable and having the same issue. However, this time I noticed few things - some hours before falling off the bus, I started getting hardware acceleration errors when trying to run mpv player, it still could play videos though, probably resorting to software acceleration. Also, this time nvidia-bug-report.sh finished without safe mode, so hopefully more debug info was collected.
nvidia-bug-report.log.gz (1.7 MB)

Based on your suggestion to downgrade, I tried to downgrade to 535 and see if it helps (I know you recommended 565 but the issue was still present for me in 550 and 570, so I decided to go further down to see if that helps).

Another interesting find: RTX 3090: GPU has fallen off the bus (only Linux, on Windows everything is fine) - here the user shared experience that RTX 3090 GPU falls off the bus in Linux but not Windows, and Performance mode makes it much more rare compared to Adaptive mode in the PowerMizer, which also matches my observation. This thread however mentions that 535 also has the bug, but I will see how stable it is with Performance mode. I however suspect that Performance mode makes the issue less likely rather than solve it, but only time can tell. Does not change the fact the bug is Nvidia driver and still not fixed even in the latest 575 version.

When this bug happens, it causes all GPUs to fall off the bus, my guess something crashes or gets corrupted in the kernel space because of the Nvidia driver hence all PCI-E cards fall off the bus at once. As mentioned before, I tried replacing motherboard (which was new to begin with), power supply, etc. - all had no effect on the issue at hand. Recent crash with Nvidia driver starting to have weird errors before the crash also points to the Nvidia driver, as well as people having issues only with Linux Nvidia driver but not on Windows with similar symptoms and the same video card as I have.

I hope Nvidia employee can take a look in the debug log, because this issue seriously affect system stability, and I am using server grade motherboard, power supply and online UPS.

Unfortunately, 535 version did not help either. More than that, it crashed about the same way as nvidia-dkms-575-open - I was during llama-imatrix for DeepSeek R1, using GPUs for context cache and common expert tensors. Estimated time to complete was about 20 hours, and it crashed after about 12 hours.

Now, with 535, it also crashed after about 12 hours running the llama-imatrix command. Here is the new bug report:
nvidia-bug-report.log.gz (239.5 KB)

And always all GPUs fall off the bus, I tried even powering some of them from separate, different power supply, in one of previous tests, made no difference how it happens (if the issue was related to power supply, then no chance it would happen at the same time on all cards if two are used). Using single powerful 2880W power supply does not affect the issue either.

Another thing - here nvidia error "GPU has fallen off the bus" · Issue #3363 · pop-os/pop · GitHub I found additional reports people having “GPU fell off the bus” with Linux driver, but exactly the same system working fine when running Windows. Even though some people may experience similar error due to bad power supply or other hardware issues, this is not the case for me - not only I tried replacing both power supply and motherboard, all my components are premium server grade, powered by online UPS. All this pointing towards Nvidia Linux driver bug.

I am using Adaptive PowerMizer mode again because Performance mode even though may reduce probability of triggering the bug, proved not to work on my current workload and crash happened at about the same time as without it.

Anyway, I decided to try upgrading to 575.51.02 after adding this PPA: sudo add-apt-repository ppa:graphics-drivers/ppa - this time using normal (non-open) version. Additionally, from the linked thread above, I added the following possible workarounds:

In /etc/default/grub edited the CMBLINE like this and then ran sudo update-grub:

GRUB_CMDLINE_LINUX_DEFAULT="nvidia.NVreg_EnableGpuSleep=0 nvidia.NVreg_EnableGpuFirmware=0 nvidia_drm.modeset=1 nvidia_drm.fbdev=1 pcie_aspm=off"

In /etc/modprobe.d/nvidia.conf I have the following content (blacklist nouveau was already there, so I just added the second line):

blacklist nouveau
options nvidia NVreg_PreserveVideoMemoryAllocations=0

I only upgraded Nvidia driver and added workarounds above, then rebooted. Now, exactly the same llama-imatrix command gives me estimate to complete in less than 12 hours instead of more than 20 hours like before. Very strange. Even though I appreciate performance boost, my concern that even if it now succeeds, it would be unclear if the “GPU fell off the bus” issue is fixed or if it completed before it got triggered, or something else changed in a way it processed so it may not trigger the bug.

Some people reported that these workarounds may help with the issue, so I am hopeful, but we will see. It is really not predictable, even when I just thought I may be found a way to reproduce it, things changed once I updated to most latest 575 driver from the PPA. If I trigger the issue again even with all these workarounds, I will add updated bug report for the newest driver.

Unfortunately, newer driver or these workaround did not help, and all GPUs fell of the bus once again.

I noticed that some hours before they did, CUDA started failing to initialize often when starting applications, but starting to work after some tries, then beginning to fail to initialize again. Here I provided bug report when CUDA was failing to initialize and additional details: With latest 575.51.02 driver, after working for some time, CUDA started to fail to initialize after a day of uptime

Then, GPUs fell off the bus soon after that. Here is new bug report, with latest 375.51.02 driver:
nvidia-bug-report.log.gz (2.8 MB)

I would like to reiterate that this issue is specific to Linux Nvidia driver. In above messages I already provided links to multiple independent reports (1, 2) of people having the GPU falling off the bus issue in Linux, but working fine in Windows.

Can someone from Nvidia investigate this please? And me know if I can provide more debug information.

I tried disabling C-state control in BIOS to see if that makes a difference, but it did not. I also enabled detailed debug logs along with trying few possible workarounds in /etc/modprobe.d/nvidia.conf (here, GPU firmware is enabled, I tried disabling it before but it did not help):

options nvidia NVreg_PreserveVideoMemoryAllocations=0
options nvidia NVreg_DynamicPowerManagement=0
options nvidia NVreg_EnableGpuFirmwareLogs=2
options nvidia NVreg_EnableResizableBar=1
options nvidia NVreg_EnableStreamMemOPs=1
options nvidia NVreg_InitializeSystemMemoryAllocations=1
options nvidia NVreg_PreserveVideoMemoryAllocations=1
options nvidia NVreg_EnableGpuFirmware=1
options nvidia NVreg_ResmanDebugLevel=2
options nvidia NVreg_EnableGpuSleep=0

After about a day, all GPUs fell of the bus once again. I collected detailed bug report but it has size more than 100 MB so I cannot attach it to the comment, and instead I uploaded it to my host:

https://dragon.studio/2025/06/nvidia-bug-report.log.gz

Here is the part of the log when the first GPU was reported fallen off the bus:

2025-06-03T20:54:17.594605+00:00 camn kernel: NVRM: ioctl(0x2a, 0x1a781c40, 0x20)
2025-06-03T20:54:17.594606+00:00 camn kernel: NVRM: nvidia_close on GPU with minor number 0
2025-06-03T20:54:17.594607+00:00 camn kernel: NVRM: nvidia_close on GPU with minor number 1
2025-06-03T20:54:17.594607+00:00 camn kernel: NVRM: nvidia_close on GPU with minor number 2
2025-06-03T20:54:17.594608+00:00 camn kernel: NVRM: nvidia_close on GPU with minor number 3
2025-06-03T20:54:17.594609+00:00 camn kernel: NVRM: ioctl(0x29, 0x1a7826c0, 0x10)
2025-06-03T20:54:17.594617+00:00 camn kernel: NVRM: nvidia_close on GPU with minor number 255
2025-06-03T20:54:17.594619+00:00 camn kernel: NVRM: nvidia_ctl_close
2025-06-03T20:54:17.651580+00:00 camn kernel: usb 6-2: USB disconnect, device number 3
2025-06-03T20:54:17.665574+00:00 camn kernel: NVRM: ioctl(0x2a, 0x4ffed820, 0x20)
2025-06-03T20:54:17.665576+00:00 camn kernel: NVRM: Xid (PCI:0000:41:00): 79, pid=12680, name=Xorg, GPU has fallen off the bus.
2025-06-03T20:54:17.665576+00:00 camn kernel: NVRM: GPU 0000:41:00.0: GPU has fallen off the bus.
2025-06-03T20:54:17.677302+00:00 camn kernel: NVRM: GPU1 GSP RPC buffer contains function 78 (DUMP_PROTOBUF_COMPONENT) and data 0x0000000000000000 0x0000000000000000.
2025-06-03T20:54:17.677304+00:00 camn kernel: NVRM: GPU1 RPC history (CPU -> GSP):
2025-06-03T20:54:17.677305+00:00 camn kernel: NVRM:     entry function                   data0              data1              ts_start           ts_end             duration actively_polling
2025-06-03T20:54:17.677305+00:00 camn kernel: NVRM:      0    76   GSP_RM_CONTROL        0x0000000000730108 0x0000000000000010 0x000636b11578d846 0x0000000000000000          y
2025-06-03T20:54:17.677306+00:00 camn kernel: NVRM:     -1    76   GSP_RM_CONTROL        0x0000000000730245 0x0000000000000810 0x000636b11578d75c 0x000636b11578d7d3    119us
2025-06-03T20:54:17.677306+00:00 camn kernel: NVRM:     -2    76   GSP_RM_CONTROL        0x0000000000730246 0x000000000000080c 0x000636b11578d6c9 0x000636b11578d751    136us
2025-06-03T20:54:17.677307+00:00 camn kernel: NVRM:     -3    76   GSP_RM_CONTROL        0x0000000000730246 0x000000000000080c 0x000636b11578d61c 0x000636b11578d6be    162us
2025-06-03T20:54:17.677308+00:00 camn kernel: NVRM:     -4    76   GSP_RM_CONTROL        0x0000000000730291 0x0000000000000010 0x000636b11578d533 0x000636b11578d60f    220us
2025-06-03T20:54:17.677309+00:00 camn kernel: NVRM:     -5    76   GSP_RM_CONTROL        0x0000000000730108 0x0000000000000010 0x000636b11578d450 0x000636b11578d524    212us
2025-06-03T20:54:17.677309+00:00 camn kernel: NVRM:     -6    76   GSP_RM_CONTROL        0x0000000000730108 0x0000000000000010 0x000636b11578d28b 0x000636b11578d413    392us
2025-06-03T20:54:17.677312+00:00 camn kernel: NVRM:     -7    10   FREE                  0x00000000c1d22087 0x0000000000000000 0x000636b115787701 0x000636b115787826    293us
2025-06-03T20:54:17.677313+00:00 camn kernel: NVRM: GPU1 RPC event history (CPU <- GSP):
2025-06-03T20:54:17.677313+00:00 camn kernel: NVRM:     entry function                   data0              data1              ts_start           ts_end             duration during_incomplete_rpc
2025-06-03T20:54:17.677314+00:00 camn kernel: NVRM:      0    4128 GSP_POST_NOCAT_RECORD 0x0000000000000005 0x0000000001addb7c 0x000636aea6e18fc0 0x000636aea6e18fc3      3us
2025-06-03T20:54:17.677315+00:00 camn kernel: NVRM:     -1    4128 GSP_POST_NOCAT_RECORD 0x0000000000000005 0x0000000001addb7c 0x000636ae9cf9b0d2 0x000636ae9cf9b0d5      3us
2025-06-03T20:54:17.677316+00:00 camn kernel: NVRM:     -2    4128 GSP_POST_NOCAT_RECORD 0x0000000000000005 0x0000000001addb7c 0x000636ae8e1f1408 0x000636ae8e1f140b      3us
2025-06-03T20:54:17.677317+00:00 camn kernel: NVRM:     -3    4128 GSP_POST_NOCAT_RECORD 0x0000000000000005 0x000002830181c416 0x000636ae87cb6f83 0x000636ae87cb6f86      3us
2025-06-03T20:54:17.677317+00:00 camn kernel: NVRM:     -4    4128 GSP_POST_NOCAT_RECORD 0x0000000000000005 0x000002830181c416 0x000636ae87c9aa77 0x000636ae87c9aa79      2us
2025-06-03T20:54:17.677318+00:00 camn kernel: NVRM:     -5    4128 GSP_POST_NOCAT_RECORD 0x0000000000000005 0x000002830181c416 0x000636ae87c94f73 0x000636ae87c94f76      3us
2025-06-03T20:54:17.677318+00:00 camn kernel: NVRM:     -6    4128 GSP_POST_NOCAT_RECORD 0x0000000000000002 0x0000000000000025 0x000636ae87bd63a5 0x000636ae87bd63aa      5us
2025-06-03T20:54:17.677319+00:00 camn kernel: NVRM:     -7    4099 POST_EVENT            0x0000000000000001 0x0000000000000000 0x000636ae87bd635f 0x000636ae87bd6376     23us
2025-06-03T20:54:17.677319+00:00 camn kernel: CPU: 3 UID: 0 PID: 12680 Comm: Xorg Tainted: P           OE      6.14.0-15-generic #15-Ubuntu
2025-06-03T20:54:17.677320+00:00 camn kernel: Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
2025-06-03T20:54:17.677320+00:00 camn kernel: Hardware name: Giga Computing MZ32-AR1-00/MZ32-AR1-00, BIOS M18 07/28/2023

Here are a portion of the log that contains 1000 lines after that (and additional lines before the event in case they provide useful information): GPUs fell off the bus - Pastebin.com

Can somebody please help me? I build my workstation for work, but currently it is unusable. I tried replacing motherboard, PSU, I am also using online UPS, and given all GPUs fall off the bus at once, it is strongly points towards driver issue. Maybe Nvidia driver is unstable on a server motherboard, I do not know. But what I know that I already tried everything I can to exclude possibility of hardware issues, so I am not sure what else to try?

It would be of great help if someone from Nvidia would look into it and at least let me know if this is a driver issue or if debug logs that I provided are not enough to determine that, what else I need to provide to get support?

Having the same problems with my 3090 for a couple of driver versions now (I think ever since 565).
System only has one GPU though, running with a Cooler Master 650W PSU.
For me, running LLMs rarely trigger the supposed bug but what consistently triggers it are three games:
Warframe, Metaphor ReFantazio and Death Stranding.
Occasionally it may work fine for a day and then resume crashing for weeks.
I have since swapped motherboards and power supplies and the problem persists.
I have tried some workarounds I found online but so far nothing has worked.
Warframe with maxed settings will consistently cause the card to fail from mere seconds to a couple of minutes of booting the game, for the others it seems more random. (Though I’ve noticed that extra displays and or streaming the game through Sunshine/Steam Remote Play seem to make it happen more frequently.)

Lastly, sometimes it will log different messages to the journal. Like Pageflip timed out! This is a bug in the nvidia-drm kernel driver or Failed to allocate NVKMS memory for GEM object that may or may not follow GPU has fallen off the bus. I will collect a few logs and may post them here in the future.

Forgot to post specs in case it’s relevant.
Distro: CachyOS (Used to use Arch before, then Endeavour and had the problem on both)
Kernel: Linux 6.15.1-5-cachyos
CPU: AMD Ryzen 5 5600G
MoBo: Gigabyte B450M GAMING
GPU: NVIDIA GeForce RTX 3090 (Gainward)
Memory: 2x32GB DDR4 3200MT/s

So I’ve found out that my boost clocks were out of control, reaching as high as 1975MHz (as opposed to the usual1695MHz boost for these cards¹).

I’ve done some testing and limiting clocks down to 1725¹ seem to have solved the issue.

The preliminary tests were done with Warframe, maxxed settings, 1080p. Rig was rebooted between runs.

1st run: Uncapped clock - Fallen off the bus in under 5 minutes.
2nd run: Capped to 1800MHz - Ran for about two hours until I got bored. Stopped with no issues.
3rd run: Uncapped clock - Didn’t even get past the login screen.
4th run: Capped now to 1725MHz - Played for 50 minutes, no issues.

I’ve also left it running overnight training a UNet with nnUNet and it lasted for 10 hours² with clocks set to 1800, am running another batch now with clocks set to 1725 and will report back with more findings along with tests within different scenarios.

Âą Stock FE cards should boost to 1695MHz afaik, while Gainward lists their 3090 boosting to 1725 from the factory.
² Former record was 3 hours.

Thank you for sharing your findings. By the way, how did you exactly manage to cap the GPU clock frequency? I see it can boost up to 1980 MHz on all four cards - I do not have any overclocking though.

I am not sure though if it will help in my case though, since for me, not only all four cards fall off the bus, but I noticed that also PCI-E USB adapter with some USB disks connected also fails. Maybe Nvidia driver causes the whole PCI-E bus to fail somehow. Not sure if GPU clock can cause this, given if it happens, then it would be limited to a single GPU, unless there is a driver bug that causes the system-wide issue.

Since replacing the motherboard with the same one did not help, and I got no other ideas in the meantime, I end up ordering completely different motherboard (Gooxi G2SERO-B) in the hope it may help, but I did not received it yet. I wish I could try to cap frequencies before that, just in case to see if it helps before deciding to ordering another one.

I also never found a quick way to reproduce the issue. Right now, I have more than 3 days of uptime, but it crashed in about a day before that, and even under an hour in some cases. One time, I had more than two weeks of uptime.

Looking at the TechPowerUp 3090 page, different manufacturers are setting max boost clocks all over the place, which may be OK for intermittent full loading, but probably not the best when running heavy tasks.

nvidia-smi can be used to limit clocks on cards where that option is supported.

I found the right command:

sudo nvidia-smi --lock-gpu-clocks=210,1725

I set 210 as the minimum because it is what nvidia-settings says as the current lower clock. After running this one command, limits were adjusted on all four GPUs, and now they do not boost beyond 1725. That said, I am not sure if it will help in my case. Even if it does, most likely no need to underclock it that much, but perhaps to 1925 or something, but since I do not have a quick way to reproduce the issue, I decided to start at the lowest clock first.

If you still have issues, it may be worth going a step lower, as Nvidia’s reference design uses 1695 max.

You do say it usually happens when the system is idle, so probably not the prime cause.

This was a gigantic fluke. That’s why you avoid small sample sizes folks!
Not only is the problem back regardless of memory/core clock settings, it is now happening to games where the issue never happened before.
Specifically Elden Ring Nightreign which I’ve logged over 50 hours of playtime since launch (May 29, 2025). It now won’t get past first loading screen.
Only thing that changed since then was that I updated to the newest drivers, though I’m hesitant to blame it.

I tried a few more things since and the only oddity I noticed was the fact that LACT would report HW throttling even though temps were good (if you consider 80ÂşC on the package and 82ÂşC on the VRAM good).

I have also reseated the GPU and the power cables to my PC again just in case and nada.

I’m sincerely considering grabbing a cheap SATA SSD and installing Windows shudders just to see.