I am a Ph.D. student building a server for my laboratory with four 3090 Tis, but am running into some difficult-to-diagnose bugs. The machine is using Ubuntu Server 20.04. When I only use one or two GPUs, I rarely see these issues, but upon plugging in three or four GPUs, I see issues like “Xid 122: SPI read failure at address XXXX” as well as occasional “Xid 79: GPU has fallen off the bus” errors during boot. Some of the time, I will only be able to see 3 of the GPUs using the nvidia-smi command following these errors, though the missing device is contained in the output of lspci. Most concerningly, though, even when all 4 GPUs can be seen, nvidia-smi can take up to a minute to return GPU information, and loading data onto the GPUs can similarly take more time than expected. This problem gets even worse when the GPUs have a job on them, and sometimes can cause the watchdog daemon to report CPU soft lockups.
I have done significant testing on this problem, though when the significant issues are present, the whole system locks up when using the nvidia-bug-report.sh command. I have tested each GPU alone, as well as each PCIe port, all BIOS/UEFI settings that seems like it could pertain to the issue (4G decoding, BAR resize support, CSM support, etc.) but these did not help. I have also tried many options for kernel parameters like those pertaining to rcu idling and pcie_aspm, many driver versions (all available combinations of headless, server, or no-dkms options for 470, 505, 510, 520, and 525) but this does not seem to fix the underlying issues.
Please let me know if there is anything I can do to further test/diagnose this problem or if there might be potential solutions! Thanks!