2 x NVIDIA RTX 3090
AMD Ryzen 9 3900XT
ASRock X570 Motherboard
Antec Signature 1300W Power Supply
Installed Nvidia driver 455.38, CUDA 11.1, PyTorch 1.7.1, on Ubuntu 20.04 and tried running deep learning benchmarks.
The problem is everything runs fine if I use either the first GPU, or the second GPU. But the moment when I run both of them, the PC just shuts off. I suspected power related issue but I installed Windows and ran all the GPU benchmarks but everything runs totally fine. Now Iâm suspected some driver related issue. Did anyone else face something similar? Or is there something I can do to narrow down the issue? Thanks!
If it is a spontaneous reboot, or a bunch of âGPU fell off the busâ error messages in the system log, it is likely power related. Generally speaking, a 1500W power supply should be sufficient for dual RTX 3090s.
Note that standard GPU stress tests may not stress sufficiently, as system instability is usually down to sudden power spikes of both CPU(s) and GPU(s) coinciding which can overwhelm power supplies, especially cheap ones. This happens in some games, and more commonly in GPU compute applications. Whatâs the complete system hardware configuration? Does the SilverStone PSU have an 80 PLUS rating?
âcrashâ means that the server just rebooted. I work remotely using VNC, so I just got connection aborted, and after a few minutes the server booted to Linux again.
we used this guide to stress the CPU:
and it didnât restart. you can see in the attached picture the state of the GPUs (ânvidia-smiâ command) during this stress test.
I canât tell if it is a PSU issue, because the stress test went well. Also I run my pytorch script after limiting the power of GPUs to 200W (using âsudo nvidia-smi -pl 200â command), and it still crashed.
As I said, sudden reboots under high and rapidly changing load strongly point toward to a power supply issue. If you can run just fine when using a single GPU (e.g. by blocking one via CUDA_VISIBLE_DEVICES), that would be further confirmation of this working hypothesis. My understanding is that the GPU power limits are enforced on the scale of seconds, which is not sufficient to prevent short-duration spikes (see more below). Note that the issue could be due to CPU power draw as well.
I asked for the total hardware configuration (CPUs, system memory, mass storage) to get an idea of the total nominal power consumption. By my guess of a typical system, a 1500W PSU should be sufficient for a system with dual RTX 3090s, but that may no actually be the case depending on specifics. It is also possible that there are imbalances in the power supply, e.g. incorrect distribution of loads across PSU rails. You might want to have the system examined by a local expert to find potential power distribution issues, this is not really doable remotely. My rule of thumb is that rock solid operation requires that total nominal power consumption should not significantly exceed 60% of the nominal power output of the PSU (power supply unit).
The reason for this large margin is that modern CPUs and GPUs use dynamic clocking and dynamic voltage levels, and as computational loads change, this can cause rapid and massive changes in instantaneous power draw. The corollary is that there will be short duration (millisecond range) power spikes that far exceed the TDP (or essentially similar) power ratings associated with these processors. These are based on power draw averaged over longer time frames on the order of minutes, and are primarily important for the design of cooling solutions (TDP = thermal design power).
Especially more cheaply constructed PSUs have a tendency to be overwhelmed by massive power spikes in high-performance systems. This can happen in two ways: The sudden increase in power draw can cause a temporary voltage drop (âbrown outâ) on the motherboard, which leads to unstable operation of electronic components (higher switching speeds require higher voltage) so the motherboard negates the âpower goodâ signal which causes a reboot. The other scenario is that the PSU itself detects a potentially destructive overload and shuts down.
I do not consider a PSU with 80 PLUS Silver rating (ike the one you linked to) suitable for modern HPC/AI systems. My strong minimum recommendation for an HPC/AI workstation is 80 PLUS Gold, preferably 80 PLUS Platinum. My recommendation for an HPC/AI server is 80 PLUS Platinum, preferably 80 PLUS Titanium. These higher-certified PSUs offer higher efficiency, which means more of the power draw at the wall outlet (which can be quite limited in the US due to our 110V AC system) is available to the system. These PSU typically also offer higher reserves, better component quality and build quality, and longer manufacturer warranties.
One more thought: Make sure there are two separate cables running from the PSU to the dual-8-pin-to-12-pin adapter for the RTX 3090, and that daisy-chaining is not used.
Each 8-pin PCIe auxilliary power cable is designed to supply 150W, and the PCIe slot itself is designed to supply up to 75W.
Hi guys, thanks for your answer. FYI: the way how I fixed this problem is by using a 2000W PSU. Someone on reddit recommended this solution by saying the peak draw of a single 3090 can go up to 1000W, so I just replaced the old PSU with a 2000W one and it worked.