Training in TAO 3.0 in DGX inside docker-compose for multi-gpus MIG

Hi, I am trying to train detectnet_v2 in TLT 3.0 in a DGX of A100-SXM4-40GB with MIG (multi instance GPUs) inside docker-compose.

I am using this service for docker-compose.yml

  tlt:
    build:
      context: /home/lvera/tlt-airflow-mlflow/tlt
      dockerfile: Dockerfile_tlt
    environment:
      NVIDIA_VISIBLE_DEVICES: 2:0,3:1,5:1
    ipc: host
    runtime: nvidia
    shm_size: 1g
    stdin_open: true
    tty: true
    ulimits:
      memlock: -1
      stack: 67108864

so, inside of this container, it recognized all gpus.

This log show the command that I use for training the model and the error.

log_train.txt (14.5 KB)

So when I train without tlt train ... --gpus 3 it only runs on one gpu and yeah, not problems, but when I pass --gpus 3 I got that error.

In conclusion, what should I write in order to train in all the gpus inside of the container.

Thanks in advance.

For MIG, may I know that how did you set up for TAO training?
To narrow down, please try to run without TAO. See https://docs.nvidia.com/datacenter/tesla/mig-user-guide/#running-containers, could you try to do a training using GPUs on the MNIST dataset?

When I am running using this command

docker run --gpus ‘“device=2:0,3:1,5:1”’ nvcr.io/nvidia/pytorch:20.11-py3 /bin/bash -c 'cd /opt/pytorch/examples/upstream/mnist && python main.py

’ its only using the first gpu.
Maybe the problem is at the moment to pass the gpus?

According to the log_train.txt the error is here:
tlt_1 | tensorflow.python.framework.errors_impl.InvalidArgumentError: 'visible_device_list' listed an invalid GPU id '2' but visible device count is 1

Could you ask the question in DGX user forum?

1 Like

Its here:

I hope to get a fast reply, thanks Morganh

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.