Training a gan on the RTX 30370 with only ~5% utilization

hello, I am training a gan with 15k binary images (256, 256).

I am using pytorch 1.7.1, cuda 11 and AMP in the code. The estimated training time is ~9h on the RTX3070.

But why is the utilization of the GPU smaller 10% and my CPU at 70%?

Also the dedicated GPU memory of 5.9/8GB is plausible.

On an older PC with the GTX 1080 Ti the training takes 18h but with less CPU utilization.

It stands to reason that all other things being equal, if you do the same amount of work in less time on the GPU, then the corresponding CPU utilization will go up.

1 Like

Use nvidia-smi instead. (Execuate nvidia-smi in command prompt)

1 Like

While nvidia-smi is OK for utilization monitoring, for continuous monitoring on Windows I would recommend TechPowerUp’s GPU-Z, which is a free download.

Robert Crovella’s hypothesis seems plausible to me. You might also want to check with the makers of pytorch or inspect the source code, and run with the CUDA profiler. I suspect a contributing factor may be the small size of the individual images which could lead to extremely short kernel run times leading to increased exposure to kernel launch overhead.

Historically, poor GPU utilization has been observed in software that tried to balance CPU and GPU work when it was first created. A decade later, as GPU performance had increased much more rapidly than CPU performance, such software often became bottlenecked on the CPU portion of the code. Other software that focused on maximizing the amount of work done on the GPU from the start (even when the efficiency of some of the code when running on the GPU was rather poor) scaled better in the long run. I don’t have experience with pytorch so cannot say which of these categories it falls into.

Even software that is aggressively parallelized with CUDA contains serial portions that run on the CPU. HPC systems with high-end GPUs should therefore utilize CPUs with high single-thread performance to avoid the serial portions of the application becoming a bottleneck. To first order this means using CPUs with a high base clock: my standing recommendation is to chose CPUs with base clock > 3.5 GHz.

1 Like