Recently I assembled a rig specced out as follows:
Ubuntu 19.04
AMD Ryzen ThreadRipper 2950X
128GB DDR4 RAM
3x RTX 2080 ti (Drivers Version 440)
1300W PSU
After running darknet (With CUDA 10.2 and OpenCV installed) for several minutes, I get the following error message
“Xid (PCI:0000:08:00): 79, pid=1619, GPU has fallen off the bus.”
Under full load (I’m training a model using darknet), the temperatures stay within 65-85C. Power also doesn’t pull over 270W/card at peak, and that rarely ever happens anyway
If there’s any other info I can provide, please let me know!
The EVGA PSU is rated at 1300W and it’s gold standard, so at 100% load it’s 87% efficient, so that takes us down to 1131W… Which is kind of cutting it close…
Do you think that’s close enough to cut the GPUs off like that? Even though I’m running the GPUs at full power, the CPU doesn’t do much work, so we do have ~150W buffer. That being said though, it does run fine when only two of the cards are active, so it may very well be the PSU.
Also, I ran gpuburn for 20 minutes and all of the cards stayed online the entire time.
Sorry for being so thorough, I just want to be sure before I make a recommendation to my team (it’s very corporate, so it’s a pain to change things)
Btw, any PSUs that you recommend? I’m looking at this: CORSAIR AXi Series AX1600i CP-9020087-NA 1600W Digital ATX Power Supply - Newegg.com but using my previous calculations and 400w max/gpu instead, well the 1600w titanium puts out 1440w at 100% load and the max load is now 1467… Like I said, the cpu doesn’t run at 100% load… But that’s still a bit tight
I wouldn’t recommend those because those are digitally configurable PSUs running per default in multi-rail mode afaik and need the Corsair software to configure them for single-rail mode. I wouldn’t count on that software being available for Linux. Without single-rail you’ll run into the same problem.
Hmm okay, so the better option is to just get a second psu then. Can you point me to an article about that? I read the post that you sent about the Quadros and the 6KW psu with n+1, but I’m not entirely sure how to set that up
That was specific to that HPE server, it had four psus which can be set in bios as either online or fail-over spare.
I can’t really wholeheartedly recommend any psu, not my field of knowledge. I only know about some of the pitfalls.