gpu lost in training with darknet.

hello.

i just a make a Deep Learning server recently.
my environment ar below here.
CPU : Ryzen 2970wx
RAM : 64G
GPU : 2080ti X 4
Power : 1700watts
OS : Ubuntu 18.04
driver : 418
cuda : 10.1

I test many ways of combine,
1 gpu train : ok
2 gpu train : ok
3 gpu train : ok
1 x 1 ~ 2 gpu : ok

always i use all gpus, i got error. GPU LOST.

how can i fix this problem?

nvidia-bug-report.log.gz (2.39 MB)

might be a power supply issue. Yes, I can read that you said power 1700 watts, whatever that means. I’m not going to argue about it.

As Robert Crovella said, this is very likely an issue with insufficient power supply. Based on your stated specifications, the power draw of the host platform is around 300W, and each 2080 Ti accounts for at least 250W (some models could draw up to 270W, depending on vendor). The maximum continuous load (~= thermal design power or TDP) of the system is therefore around 1300W at minimum.

I am not aware of power supply units spec’ed at 1700W. For the American market, 1600W power supplies typically represent the high end for individual PSUs, based on current limits for 120V power outlets. Your system may use multiple power supplies, such as dual 850W units.

My rule of thumb for rock-solid operation is that the sum of nominal power draw of all system components should not exceed 60% of the nominal power rating of the PSU. This is based on long-term experience, and the fact that the instantaneous power draw of CPUs and GPUs often exceeds stated long-term design power by 20% to 25%. These are usually short-term power peaks in the double-digit millisecond range. Based on this rule of thumb, you would need a PSU capable of delivering 2166W. You may be able to get away with a 2000W power supply solution.

For best results (longevity, quality of components, efficiency -> electricity bill), I would suggest a PSU compliant with 80 PLUS Platinum specifications.

Make sure that each GPU is supplied by its own PCIe power cable (no Y-splitters, no daisy chains, no converters)

i live in South Korea,
My PSU is ENERMAX EPM1700EGT. This product are 90Plus ready. so, I don’t question that PSU problem.
but, i prepare for dual PSU for this case.

Beware! That 90-something label is a vendor self-designation that means pretty much nothing. A handy list of 80 PLUS certified PSUs can be found here: https://www.plugloadsolutions.com/80PlusPowerSupplies.aspx. For what it’s worth, I don’t see a 1700W model among the Enermax models listed there. A British site that sells your PSU model says: “Enermax EPM1700EGT, 80 PLUS Platinum”.

My general recommendation for PSUs is to pick a 80 PLUS Platinum certified model for high-end compute workstations, and a 80 PLUS Titanium (the highest available designation) certified model for high-end compute servers. Where electrical power is cheap and PSU price is an issue, one could drop that by one level (Gold for workstations, Platinum for servers) as an absolute minimum.

right, i mistype 80 to 90.

thank you for your advise. i’ll make a dual PSU, and reporting result.

i make a Dual PSU with Enermax 1700Watts(i think that this Product is sucks) and Micronics 500Watts.

and, i try to Deep Learning with Darknet Yolov3, GPU is not lost.

thanks everybody.

Keep in mind that depending on where you live, counterfeit PSUs may be a thing. I am aware of counterfeit PSU components, counterfeit CPUs, and counterfeit GPUs in various Asian markets. In this forum we have had a half-dozen reports of counterfeit GPUs over the years, as I recall.