Two out of three GPUs lost, even in lspci, shutdown does not help

Hi, I have there 3090 GPUs on my server and I ran the stylegan2 (in pytorch), and after some time, two of the GPUs died.
I can’t see them not only in nvidia-smi, but also not in the lspci. I’m using the latest version of nvidia driver and I’m using centos 8.3.2011.
I will also attach my nvidia-bug-report.nvidia-bug-report.log (1.6 MB)
Thank you in advance.

Can you clarify “shutdown does not help”. Does this mean you power cycled the system repeatedly, but two of the three RTX 3090 do not show up in lspci after reboot? If that is not what you mean by shutdown, I would suggest trying power cycling and cold reboot first.

I have had occasional trouble with GPUs not being recognized in a workstation for reasons that are not clear to me and the “Voodoo chicken” method I used was to power down the system, unplug the auxilliary GPU power connector[s], remove the GPU from the PCIe slot, plug it back in ensuring it is properly inserted, make sure it is mechanically secured at the bracket, reconnect the power cable[s], and start up the system.

A GPU can “fall of the bus” (search for error messages of that sort in system logs) when running deep learning applications, the by far most common reason for which is an inadequately sized power supply. However, this should not damage the GPU and it should re-appear after power cycling the system.

Adequate power supply for rock-solid operation means the sum of nominal the power draw for all system components does not significantly exceed 60% of the nominal wattage of the PSU (power supply unit). In your case that is probably something like 2000W, as three 3090s together already need 1050W TDP (which does not account for short-lived power spikes).

I had a chat to Seasonic, they let me know that in their labs they have seen RTX 3090 transient loads spike to north of 550W before the power limits kick in and pull them back down.

1 Like

Thank you so much, hard reset helped, I was using the reboot command before, but completely shutting down resolved the issue.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.