Hi, I training MNIST on a DGX of A100-SXM4-40GB with MIG (multi instance GPUs) using docker but it only detects the first gpu.
docker version : 20.10.7
I am using this command:
docker run --gpus ‘“device=2:0,3:1,5:1”’ nvcr.io/nvidia/pytorch:20.11-py3 http://nvcr.io/nvidia/pytorch:20.11-py3 /bin/bash -c 'cd /opt/pytorch/examples/upstream/mnist && python main.py
An image of nvidia-smi, where just the first gpu is being used.
When I am training another model in TAO I got a similar problem, but I think its related to the above one.
tlt_1 | tensorflow.python.framework.errors_impl.InvalidArgumentError: 'visible_device_list' listed an invalid GPU id '2' but visible device count is 1
Thanks in advance.
As a sanity check, does this happen when you use the device names as MIG-
like described by https://docs.nvidia.com/datacenter/tesla/mig-user-guide/#running-containers ?
ScottE
I used that names too:
docker run --rm -it --gpus '"device=MIG-6b1e9520-11da-5757-b9df-8ac6a3d9a78f,MIG-10d8cd6b-6356-557a-9ed4-fe10edf9e394,MIG-ecf7a5d7-fd31-5554-a5e4-42717ccdf584"' nvcr.io/nvidia/pytorch:20.11-py3 /bin/bash
but same results, its only running on the first gpu.
Ah, I get what you’re saying now @leo2105 .
When using MIG, a single application can only actually execute on a single “GPU” (aka MIG device). You could run two different PyTorch instances inside the container, each using a different slice, but a single instance can only access one slice.
So, you are saying that I can’t train TAO models in DGX with MIG at scale, I mean using more than 1 gpu?
Correct. You can disable MIG on the GPUs and do multi-GPU training on the DGX though.
1 Like
Hi @ScottEllis, I have one more doubt related to what you said before,
“When using MIG, a single application can only actually execute on a single ‘GPU’ (aka MIG device). You could run two different PyTorch instances inside the container, each using a different slice, but a single instance can only access one slice.”
It applies for all applications? for example, Tensorflow, Triton, Deepstream, TAO, etc.
Thanks in advance
1 Like
Hi, I’m experiencing the same issue as @leo2105. I would like to partition my A30 GPU into four segments and develop a program that utilizes all segments simultaneously to accelerate the inference during the training process.
Currently, when I attempt to identify the instances from TensorFlow after creating the partitions using the appropriate MIG commands, I can only detect the first instance.
Is something only for TF? Is there any possibility with MIG for that?
Thanks