Keep getting "GPU has fallen off the bus" with 3090 cards on Gigabyte MZ32-AR1 Rev 3.0 motherboard

Unfortunately, 535 version did not help either. More than that, it crashed about the same way as nvidia-dkms-575-open - I was during llama-imatrix for DeepSeek R1, using GPUs for context cache and common expert tensors. Estimated time to complete was about 20 hours, and it crashed after about 12 hours.

Now, with 535, it also crashed after about 12 hours running the llama-imatrix command. Here is the new bug report:
nvidia-bug-report.log.gz (239.5 KB)

And always all GPUs fall off the bus, I tried even powering some of them from separate, different power supply, in one of previous tests, made no difference how it happens (if the issue was related to power supply, then no chance it would happen at the same time on all cards if two are used). Using single powerful 2880W power supply does not affect the issue either.

Another thing - here nvidia error "GPU has fallen off the bus" · Issue #3363 · pop-os/pop · GitHub I found additional reports people having “GPU fell off the bus” with Linux driver, but exactly the same system working fine when running Windows. Even though some people may experience similar error due to bad power supply or other hardware issues, this is not the case for me - not only I tried replacing both power supply and motherboard, all my components are premium server grade, powered by online UPS. All this pointing towards Nvidia Linux driver bug.

I am using Adaptive PowerMizer mode again because Performance mode even though may reduce probability of triggering the bug, proved not to work on my current workload and crash happened at about the same time as without it.

Anyway, I decided to try upgrading to 575.51.02 after adding this PPA: sudo add-apt-repository ppa:graphics-drivers/ppa - this time using normal (non-open) version. Additionally, from the linked thread above, I added the following possible workarounds:

In /etc/default/grub edited the CMBLINE like this and then ran sudo update-grub:

GRUB_CMDLINE_LINUX_DEFAULT="nvidia.NVreg_EnableGpuSleep=0 nvidia.NVreg_EnableGpuFirmware=0 nvidia_drm.modeset=1 nvidia_drm.fbdev=1 pcie_aspm=off"

In /etc/modprobe.d/nvidia.conf I have the following content (blacklist nouveau was already there, so I just added the second line):

blacklist nouveau
options nvidia NVreg_PreserveVideoMemoryAllocations=0

I only upgraded Nvidia driver and added workarounds above, then rebooted. Now, exactly the same llama-imatrix command gives me estimate to complete in less than 12 hours instead of more than 20 hours like before. Very strange. Even though I appreciate performance boost, my concern that even if it now succeeds, it would be unclear if the “GPU fell off the bus” issue is fixed or if it completed before it got triggered, or something else changed in a way it processed so it may not trigger the bug.

Some people reported that these workarounds may help with the issue, so I am hopeful, but we will see. It is really not predictable, even when I just thought I may be found a way to reproduce it, things changed once I updated to most latest 575 driver from the PPA. If I trigger the issue again even with all these workarounds, I will add updated bug report for the newest driver.