SPI read failures on multi-GPU machine

I am a Ph.D. student building a server for my laboratory with four 3090 Tis, but am running into some difficult-to-diagnose bugs. The machine is using Ubuntu Server 20.04. When I only use one or two GPUs, I rarely see these issues, but upon plugging in three or four GPUs, I see issues like “Xid 122: SPI read failure at address XXXX” as well as occasional “Xid 79: GPU has fallen off the bus” errors during boot. Some of the time, I will only be able to see 3 of the GPUs using the nvidia-smi command following these errors, though the missing device is contained in the output of lspci. Most concerningly, though, even when all 4 GPUs can be seen, nvidia-smi can take up to a minute to return GPU information, and loading data onto the GPUs can similarly take more time than expected. This problem gets even worse when the GPUs have a job on them, and sometimes can cause the watchdog daemon to report CPU soft lockups.

I have done significant testing on this problem, though when the significant issues are present, the whole system locks up when using the nvidia-bug-report.sh command. I have tested each GPU alone, as well as each PCIe port, all BIOS/UEFI settings that seems like it could pertain to the issue (4G decoding, BAR resize support, CSM support, etc.) but these did not help. I have also tried many options for kernel parameters like those pertaining to rcu idling and pcie_aspm, many driver versions (all available combinations of headless, server, or no-dkms options for 470, 505, 510, 520, and 525) but this does not seem to fix the underlying issues.

Please let me know if there is anything I can do to further test/diagnose this problem or if there might be potential solutions! Thanks!

Xid 79 might occur on power peak due to insufficient psu of also overheating due to the 3090s blocking each other’s airflow.
Depending on the cpu used, this might also be a pcie pm issue, try upgrading the kernel using the liquorix ppa and check for a bios update.