Hi, I am trying to train detectnet_v2 in TLT 3.0 in a DGX of A100-SXM4-40GB with MIG (multi instance GPUs) inside docker-compose.
I am using this service for docker-compose.yml
tlt:
build:
context: /home/lvera/tlt-airflow-mlflow/tlt
dockerfile: Dockerfile_tlt
environment:
NVIDIA_VISIBLE_DEVICES: 2:0,3:1,5:1
ipc: host
runtime: nvidia
shm_size: 1g
stdin_open: true
tty: true
ulimits:
memlock: -1
stack: 67108864
so, inside of this container, it recognized all gpus.
This log show the command that I use for training the model and the error.
log_train.txt (14.5 KB)
So when I train without tlt train ... --gpus 3
it only runs on one gpu and yeah, not problems, but when I pass --gpus 3
I got that error.
In conclusion, what should I write in order to train in all the gpus inside of the container.
Thanks in advance.
For MIG, may I know that how did you set up for TAO training?
To narrow down, please try to run without TAO. See https://docs.nvidia.com/datacenter/tesla/mig-user-guide/#running-containers , could you try to do a training using GPUs on the MNIST dataset?
When I am running using this command
docker run --gpus ‘“device=2:0,3:1,5:1”’ nvcr.io/nvidia/pytorch:20.11-py3 /bin/bash -c 'cd /opt/pytorch/examples/upstream/mnist && python main.py
’ its only using the first gpu.
Maybe the problem is at the moment to pass the gpus?
According to the log_train.txt
the error is here:
tlt_1 | tensorflow.python.framework.errors_impl.InvalidArgumentError: 'visible_device_list' listed an invalid GPU id '2' but visible device count is 1
Could you ask the question in DGX user forum?
1 Like
Its here:
Hi, I training MNIST on a DGX of A100-SXM4-40GB with MIG (multi instance GPUs) using docker but it only detects the first gpu.
docker version : 20.10.7
I am using this command:
docker run --gpus ‘“device=2:0,3:1,5:1”’ nvcr.io/nvidia/pytorch:20.11-py3 http://nvcr.io/nvidia/pytorch:20.11-py3 /bin/bash -c 'cd /opt/pytorch/examples/upstream/mnist && python main.py
An image of nvidia-smi, where just the first gpu is being used.
[image]
When I am training another model in TAO I got a similar p…
I hope to get a fast reply, thanks Morganh
1 Like
system
Closed
March 1, 2022, 12:33am
8
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.