nvidia-smi is OK for utilization monitoring, for continuous monitoring on Windows I would recommend TechPowerUp’s GPU-Z, which is a free download.
Robert Crovella’s hypothesis seems plausible to me. You might also want to check with the makers of
pytorch or inspect the source code, and run with the CUDA profiler. I suspect a contributing factor may be the small size of the individual images which could lead to extremely short kernel run times leading to increased exposure to kernel launch overhead.
Historically, poor GPU utilization has been observed in software that tried to balance CPU and GPU work when it was first created. A decade later, as GPU performance had increased much more rapidly than CPU performance, such software often became bottlenecked on the CPU portion of the code. Other software that focused on maximizing the amount of work done on the GPU from the start (even when the efficiency of some of the code when running on the GPU was rather poor) scaled better in the long run. I don’t have experience with
pytorch so cannot say which of these categories it falls into.
Even software that is aggressively parallelized with CUDA contains serial portions that run on the CPU. HPC systems with high-end GPUs should therefore utilize CPUs with high single-thread performance to avoid the serial portions of the application becoming a bottleneck. To first order this means using CPUs with a high base clock: my standing recommendation is to chose CPUs with base clock > 3.5 GHz.