DGX-1 Multi gpu problem

asdewq45445 · October 26, 2023, 8:54am

Hello everyone,

I recently encountered an issue while working with Docker and GPU utilization. Specifically, I used the “docker run --gpus all” command to launch my container, but when I executed my Python training code, I noticed that only one GPU was being utilized, even though my server has 8 P100 GPUs. The rest of the GPUs appeared to be idle, as shown in the image below:

In a typical setup, GPU allocation should happen automatically, thanks to NVLink, without the need for additional code. Is my understanding correct in this regard?

The limited GPU usage has significantly slowed down my training process, and I’m eager to resolve this issue. As a side note, my system originally ran DGX OS 3.1.2. To enable the use of CUDA 12.1, I performed a fresh installation of DGX OS 5.4 and subsequently upgraded to DGX OS 6.1. It’s possible that this upgrade has contributed to the issue I’m facing.

In addition, I conducted tests using the “PyTorch | NVIDIA NGC” image, running mnist main.py. To my surprise, it took more than 10 seconds for a single iteration, and completing one epoch required several minutes. This performance is notably slower than that of an RTX 3070 and significantly lags behind the DGX-2 T4 platform.

I’m keen to determine if there’s a misconfiguration on my system or if there are other underlying issues at play. Your insights and guidance on this matter would be greatly appreciated. Thank you in advance for your assistance.

Topic		Replies	Views
Docker instance based usage analytics on a DGX machine DGX User Forum ubuntu , docker	4	855	December 21, 2021
[Ask] Fresh Installation DGX User Forum	6	1079	January 27, 2022
Setting up nvidia-docker container toolkit. Python application in docker container accessing Nvidia-GPU after mounting docker-volumes CUDA Setup and Installation	0	1541	January 21, 2022
Docker doesn't detect MIG gpu devices DGX User Forum docker	7	3944	May 11, 2023
[Solved] Poor performance on DGX-1 than Titan X CUDA Programming and Performance	2	702	July 25, 2019
Nvidia DIGITS Container Image Deep Learning (Training & Inference)	4	563	April 26, 2018
Docker with gpu Jetson Nano cuda , docker , python	9	1284	November 15, 2023
Docker container unable to see more than 1 GPU Docker and NVIDIA Docker	0	430	February 13, 2020
One GPU of four running slowly? CUDA Programming and Performance	4	2125	March 26, 2009
GPUs hang when executing NIM docker container on a 4xA100 TensorRT cuda , cudnn , tensorrt-model-optimizer	2	143	June 29, 2024

DGX-1 Multi gpu problem

Related topics