I have EPYC 7763 CPU in Gigabyte MZ32-AR1-rev-30 motherboard, with 1TB 3200MHz RAM. The motherboard is new and has no issues except with Nvidia cards falling off the bus periodically.
I tried so far:
Removing all PCI-E devices except Nvidia cards - did not help. However, I noticed Nvidia driver can cause other PCI-E to malfunction when it falls off the bus, if they are present. Without Nvidia cards, no stability issues.
I use server grade IBM PSU rated 2880W to power four 3090 GPUs, but the issue usually happens when they are idle, so I am sure power is not the issue.
Switching to PCI-E 3.0 or even PCI-E 2.0 - did not help either, suggesting signal integrity is not an issue. Enabling PCI-E advanced error reporting in BIOS confirmed there are no errors.
Various kernel flags that did NOT help:
amd_iommu=on
kvm.ignore_msrs=1
iommu=pt pcie_aspm=off
rcutree.rcu_idle_gp_delay=1
pci=realloc=off
With these I kept getting “GPU has fallen off the bus” every 1-2 days, but sometimes in less than an hour after boot, quite random.
Disabling nvidia.NVreg_EnableGpuFirmware=0 (without other kernel options) and then running the system for a while on Performance mode:
for i in $(seq 0 $(expr ${GPU_COUNT} - 1)); do
nvidia-settings -a "[gpu:${i}]/GpuPowerMizerMode=1"
done
…then switching back to Adaptive mode after some days, resulted in 16 days uptime, at which point I powered down normally to upgrade my M.2 SSD. Another thing, during runtime I also applied suspend/resume workaround to bring down power in idle or partial load states: Reddit - The heart of the internet - which is another Nvidia driver bug that causes cards to consume 10W-15W more each than they should in idle mode, unless suspend/resume workaround is used.
I thought that may be nvidia.NVreg_EnableGpuFirmware=0 helped, since 16 days uptime was far greater than 1-3 days uptimes I was getting before when encountering the issue. But after I booted back up (this time, using Adaptive mode from the start, no suspend/resume trick, to check if nvidia.NVreg_EnableGpuFirmware=0 alone made the difference), after less than 3 days, GPUs feel off the bus again. I am attaching debug log. I had to run it with --safe-mode otherwise it hangs forever.
I am fighting this problem for over two months now, but so far it seems to be related to power management. I switched back to Performance mode and will see if I encounter the issue, but the problem is, in Performance mode my rig consumes extra almost 0.5 kW while idle on GPUs alone, which is a lot. In adaptive mode, four GPUs usually consume less than 100W while idle.
Also, GPUs came from my previous rig which was based on a gaming motherboard, and never had this issue, connected using the same PCI-E 4.0 cables.
I think this is Nvidia driver bug, since clearly power modes or power saving features have an effect on it, while everything else including switching to slower PCI-E 3.0 or 2.0, or removing other PCI-E devices, has no effect at all. That said, since it takes a lot of time to encounter the issue, it is hard to say what helped and what did not, but I hope this is enough to look into it. If anyone has any other ideas or suggestions what else to try to narrow down the issue, I would greatly appreciate them.
nvidia-bug-report.log.gz (389.9 KB)