Dual 3090 crashes when I use both the GPUs

Hi all!

I recently purchased the following rig:

2 x NVIDIA RTX 3090
AMD Ryzen 9 3900XT
ASRock X570 Motherboard
Antec Signature 1300W Power Supply

Installed Nvidia driver 455.38, CUDA 11.1, PyTorch 1.7.1, on Ubuntu 20.04 and tried running deep learning benchmarks.

The problem is everything runs fine if I use either the first GPU, or the second GPU. But the moment when I run both of them, the PC just shuts off. I suspected power related issue but I installed Windows and ran all the GPU benchmarks but everything runs totally fine. Now I’m suspected some driver related issue. Did anyone else face something similar? Or is there something I can do to narrow down the issue? Thanks!

Am assuming you did the same GPU tests on both OS’s.

I see there’s a 455.45 Linux driver up now, which may be worth a shot.

I have exactly same problem on Ubuntu 20, but I am working with pytorch nightly version (1.8), and my driver version is 460.

  1. I tried using a stress test that loaded both GPUs 100% utilization and it worked fine without crashing.
  2. I tried limiting the power of GPUs to 200W (using ‘sudo nvidia-smi -pl 200’ command), started the pytorch training script and it crashed again

so I guess it isn’t power supply issue (it’s a SilverStone 1500 watt power supply)

don’t know what can be done else?

What are the exact symptoms of the “crash”?

If it is a spontaneous reboot, or a bunch of “GPU fell off the bus” error messages in the system log, it is likely power related. Generally speaking, a 1500W power supply should be sufficient for dual RTX 3090s.

Note that standard GPU stress tests may not stress sufficiently, as system instability is usually down to sudden power spikes of both CPU(s) and GPU(s) coinciding which can overwhelm power supplies, especially cheap ones. This happens in some games, and more commonly in GPU compute applications. What’s the complete system hardware configuration? Does the SilverStone PSU have an 80 PLUS rating?

“crash” means that the server just rebooted. I work remotely using VNC, so I just got connection aborted, and after a few minutes the server booted to Linux again.

we used this guide to stress the CPU:

and it didn’t restart. you can see in the attached picture the state of the GPUs (“nvidia-smi” command) during this stress test.screenshot-1

it seems the PSU has the 80 PLUS rating, according to their site:

I can’t tell if it is a PSU issue, because the stress test went well. Also I run my pytorch script after limiting the power of GPUs to 200W (using ‘sudo nvidia-smi -pl 200’ command), and it still crashed.

As I said, sudden reboots under high and rapidly changing load strongly point toward to a power supply issue. If you can run just fine when using a single GPU (e.g. by blocking one via CUDA_VISIBLE_DEVICES), that would be further confirmation of this working hypothesis. My understanding is that the GPU power limits are enforced on the scale of seconds, which is not sufficient to prevent short-duration spikes (see more below). Note that the issue could be due to CPU power draw as well.

I asked for the total hardware configuration (CPUs, system memory, mass storage) to get an idea of the total nominal power consumption. By my guess of a typical system, a 1500W PSU should be sufficient for a system with dual RTX 3090s, but that may no actually be the case depending on specifics. It is also possible that there are imbalances in the power supply, e.g. incorrect distribution of loads across PSU rails. You might want to have the system examined by a local expert to find potential power distribution issues, this is not really doable remotely. My rule of thumb is that rock solid operation requires that total nominal power consumption should not significantly exceed 60% of the nominal power output of the PSU (power supply unit).

The reason for this large margin is that modern CPUs and GPUs use dynamic clocking and dynamic voltage levels, and as computational loads change, this can cause rapid and massive changes in instantaneous power draw. The corollary is that there will be short duration (millisecond range) power spikes that far exceed the TDP (or essentially similar) power ratings associated with these processors. These are based on power draw averaged over longer time frames on the order of minutes, and are primarily important for the design of cooling solutions (TDP = thermal design power).

Especially more cheaply constructed PSUs have a tendency to be overwhelmed by massive power spikes in high-performance systems. This can happen in two ways: The sudden increase in power draw can cause a temporary voltage drop (“brown out”) on the motherboard, which leads to unstable operation of electronic components (higher switching speeds require higher voltage) so the motherboard negates the “power good” signal which causes a reboot. The other scenario is that the PSU itself detects a potentially destructive overload and shuts down.

I do not consider a PSU with 80 PLUS Silver rating (ike the one you linked to) suitable for modern HPC/AI systems. My strong minimum recommendation for an HPC/AI workstation is 80 PLUS Gold, preferably 80 PLUS Platinum. My recommendation for an HPC/AI server is 80 PLUS Platinum, preferably 80 PLUS Titanium. These higher-certified PSUs offer higher efficiency, which means more of the power draw at the wall outlet (which can be quite limited in the US due to our 110V AC system) is available to the system. These PSU typically also offer higher reserves, better component quality and build quality, and longer manufacturer warranties.

One more thought: Make sure there are two separate cables running from the PSU to the dual-8-pin-to-12-pin adapter for the RTX 3090, and that daisy-chaining is not used.

Each 8-pin PCIe auxilliary power cable is designed to supply 150W, and the PCIe slot itself is designed to supply up to 75W.

1 Like

Hi guys, thanks for your answer. FYI: the way how I fixed this problem is by using a 2000W PSU. Someone on reddit recommended this solution by saying the peak draw of a single 3090 can go up to 1000W, so I just replaced the old PSU with a 2000W one and it worked.

I will buy this PSU:
Corsair AX1600i 1600W Full Modular PSU 80+ Titanium

do you think it’s sufficient?

Sorry, I can provide some generic guidelines, but I cannot comment on or recommend specific hardware.