RTX 2080 Ti(s) falling off the bus

https://we.tl/t-QXUc16pKuIHello,

Recently I assembled a rig specced out as follows:
Ubuntu 19.04
AMD Ryzen ThreadRipper 2950X
128GB DDR4 RAM
3x RTX 2080 ti (Drivers Version 440)
1300W PSU

After running darknet (With CUDA 10.2 and OpenCV installed) for several minutes, I get the following error message
“Xid (PCI:0000:08:00): 79, pid=1619, GPU has fallen off the bus.”
Under full load (I’m training a model using darknet), the temperatures stay within 65-85C. Power also doesn’t pull over 270W/card at peak, and that rarely ever happens anyway
If there’s any other info I can provide, please let me know!

Here is the nvidia bug log: https://we.tl/t-QXUc16pKuI

Looks like lack of power (XID 79)

So I just went through the numbers and it totals to about 1077W:

CPU: 180W
GPU: 3x 270W = 810W
RAM: ~3W/8gb = 48W
SSD: ~3W/disk = 6W
Fans: 5x 3W = 15W
Water cooler = 18W

The EVGA PSU is rated at 1300W and it’s gold standard, so at 100% load it’s 87% efficient, so that takes us down to 1131W… Which is kind of cutting it close…
Do you think that’s close enough to cut the GPUs off like that? Even though I’m running the GPUs at full power, the CPU doesn’t do much work, so we do have ~150W buffer. That being said though, it does run fine when only two of the cards are active, so it may very well be the PSU.

Also, I ran gpuburn for 20 minutes and all of the cards stayed online the entire time.

Sorry for being so thorough, I just want to be sure before I make a recommendation to my team (it’s very corporate, so it’s a pain to change things)

The problem with RTXs is power spikes, the 2080ti easily reaching 360W or more for a very short time when going into boost, taking down even PSUs with an otherwise more than sufficient power budget. Very hard to calculate, especially with three consumer RTX. Extreme example: https://devtalk.nvidia.com/default/topic/1049249/linux/quadro-rtx-6000-causes-hpe-server-to-power-off-peaks-way-over-power-limit-/post/5325050/#5325050

Gotcha, that’s very unfortunate haha. I appreciate the help though, wouldn’t have figured that out myself!

Btw, any PSUs that you recommend? I’m looking at this: CORSAIR AXi Series AX1600i CP-9020087-NA 1600W Digital ATX Power Supply - Newegg.com but using my previous calculations and 400w max/gpu instead, well the 1600w titanium puts out 1440w at 100% load and the max load is now 1467… Like I said, the cpu doesn’t run at 100% load… But that’s still a bit tight

I wouldn’t recommend those because those are digitally configurable PSUs running per default in multi-rail mode afaik and need the Corsair software to configure them for single-rail mode. I wouldn’t count on that software being available for Linux. Without single-rail you’ll run into the same problem.

Hmm okay, so the better option is to just get a second psu then. Can you point me to an article about that? I read the post that you sent about the Quadros and the 6KW psu with n+1, but I’m not entirely sure how to set that up

That was specific to that HPE server, it had four psus which can be set in bios as either online or fail-over spare.
I can’t really wholeheartedly recommend any psu, not my field of knowledge. I only know about some of the pitfalls.