I am training deep learning models on a computer with two NVIDIA RTX 2080 Ti, and I am facing the following problem.
When I only start one process on any of the GPUs and the other GPU remains idle, the process works at full speed.
However, when I start a process on GPU:0 and then another different process at GPU:1, the GPU:1 starts losing power, and the process slows down to a ridiculous speed.
Has anyone experienced something similar? Can it be a configuration problem of the drivers? Or it can be a hardware problem (motherboard, data bus performance, power) ?. I am running Ubuntu 20.04 with Nvidia driver version 450.66 and CUDA 10.1
The computer specs are:
- INTEL CORE i9 10900X
- MOTHERBOARD AORUS X299 UD4 PRO
- 4 x DDR4 16 GB 3600 Mhz. HyperX FURY RGB BLACK
- LIQUID COOLING SYSTEM H100i PRO CORSAIR
- 2 NVIDIA RTX 2080 Ti XTREME 11 GB GIGABYTE
- POWER SUPPLY 1200W
Thank you for your help.
Based on your system specs, a 1200W power supply is sufficient to run your system stably under full load. Are the PCIe auxilliary power cables for the GPUs hooked up correctly? No splitters, daisy-chaining, or converters are used in the GPU auxilliary power supply, correct?
Intel Core i9 10900X has 48 PCIe lanes, so each GPU should be on a PCIe gen3 x16 connector, provided they are plugged into the correct PCIe slots (check motherboard documentation).
While deep learning is running (make sure it runs for a while so you can observe “steady state”), use
nvidia-smi to check for power draw and temperature of the GPUs, as well as any thermal throttling or power throttling events. Make sure your liquid cooling system is running within specs for water temperature etc, and monitor the CPU temperature.
If you don’t see anything out of the ordinary on the above checklist, a hardware related issue seems unlikely, but there may be a software configuration issue. I am assuming that while deep learning is running, you are not overloading the CPU with too many active threads. Your CPU has 10 cores supporting 20 threads via hyperthreading, so you would want to keep the number of active long-running threads below that limit.