Docker doesn't detect MIG gpu devices

Hi, I training MNIST on a DGX of A100-SXM4-40GB with MIG (multi instance GPUs) using docker but it only detects the first gpu.

docker version : 20.10.7

I am using this command:

docker run --gpus ‘“device=2:0,3:1,5:1”’ nvcr.io/nvidia/pytorch:20.11-py3 http://nvcr.io/nvidia/pytorch:20.11-py3 /bin/bash -c 'cd /opt/pytorch/examples/upstream/mnist && python main.py

An image of nvidia-smi, where just the first gpu is being used.

When I am training another model in TAO I got a similar problem, but I think its related to the above one.

tlt_1 | tensorflow.python.framework.errors_impl.InvalidArgumentError: 'visible_device_list' listed an invalid GPU id '2' but visible device count is 1

Thanks in advance.

As a sanity check, does this happen when you use the device names as MIG- like described by NVIDIA Multi-Instance GPU User Guide :: NVIDIA Tesla Documentation ?

ScottE

I used that names too:

docker run --rm -it --gpus '"device=MIG-6b1e9520-11da-5757-b9df-8ac6a3d9a78f,MIG-10d8cd6b-6356-557a-9ed4-fe10edf9e394,MIG-ecf7a5d7-fd31-5554-a5e4-42717ccdf584"' nvcr.io/nvidia/pytorch:20.11-py3 /bin/bash

but same results, its only running on the first gpu.

Ah, I get what you’re saying now @leo2105 .

When using MIG, a single application can only actually execute on a single “GPU” (aka MIG device). You could run two different PyTorch instances inside the container, each using a different slice, but a single instance can only access one slice.

So, you are saying that I can’t train TAO models in DGX with MIG at scale, I mean using more than 1 gpu?

Correct. You can disable MIG on the GPUs and do multi-GPU training on the DGX though.

1 Like

Hi @ScottEllis, I have one more doubt related to what you said before,

“When using MIG, a single application can only actually execute on a single ‘GPU’ (aka MIG device). You could run two different PyTorch instances inside the container, each using a different slice, but a single instance can only access one slice.”

It applies for all applications? for example, Tensorflow, Triton, Deepstream, TAO, etc.

Thanks in advance

1 Like