I recently encountered an issue while working with Docker and GPU utilization. Specifically, I used the “docker run --gpus all” command to launch my container, but when I executed my Python training code, I noticed that only one GPU was being utilized, even though my server has 8 P100 GPUs. The rest of the GPUs appeared to be idle, as shown in the image below:
In a typical setup, GPU allocation should happen automatically, thanks to NVLink, without the need for additional code. Is my understanding correct in this regard?
The limited GPU usage has significantly slowed down my training process, and I’m eager to resolve this issue. As a side note, my system originally ran DGX OS 3.1.2. To enable the use of CUDA 12.1, I performed a fresh installation of DGX OS 5.4 and subsequently upgraded to DGX OS 6.1. It’s possible that this upgrade has contributed to the issue I’m facing.
In addition, I conducted tests using the “PyTorch | NVIDIA NGC” image, running mnist main.py. To my surprise, it took more than 10 seconds for a single iteration, and completing one epoch required several minutes. This performance is notably slower than that of an RTX 3070 and significantly lags behind the DGX-2 T4 platform.
I’m keen to determine if there’s a misconfiguration on my system or if there are other underlying issues at play. Your insights and guidance on this matter would be greatly appreciated. Thank you in advance for your assistance.