Error when training with multiple GPUs in TAO

deveso · March 27, 2023, 3:19am

I’m trying to run the example code in yolo_v4.ipynb jupyter notebook using TAO.
For this I use a Supermicro Server 4028gr-tvrt with 8x Tesla V100. When I train with one graphics card, everything works. However, if I want to train with all 8 graphics cards, I get an error.

print(“To run with multigpu, please change --gpus based on the number of available GPUs in your machine.”)
!tao yolo_v4 train -e $SPECS_DIR/yolo_v4_train_resnet18_kitti.txt
-r $USER_EXPERIMENT_DIR/experiment_dir_unpruned
-k $KEY
–gpus 8

Here is the error message and the nvida-smi output:
log.txt (169.0 KB)

nvida-smi.txt (3.6 KB)

Morganh · March 27, 2023, 5:01am

Hi,
Could you pull the 4.0.1 version of TAO docker and retry?

$ docker pull nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5

Then, login in the docker
$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5 /bin/bash

Then, run the commands. Note, “tao” is not needed when run inside the docker.

yolo_v4 train -e $SPECS_DIR/yolo_v4_train_resnet18_kitti.txt
-r $USER_EXPERIMENT_DIR/experiment_dir_unpruned
-k $KEY
--gpus 8

deveso · March 27, 2023, 6:37am

Hi, I just tried it. With --gpu 1 it works but with --gpu > 1 I get the same error.

Morganh · March 27, 2023, 6:53am

Can you use an old version of nvidia-driver?

Uninstall:
sudo apt purge nvidia-driver-525
sudo apt autoremove
sudo apt autoclean

Install:
sudo apt install nvidia-driver-515

deveso · March 28, 2023, 7:49pm

I’ve tried everything and have no idea what the problem could be. I followed all the steps from the post More than 1 GPU not working using Tao Train - #22 by user82614

I installed cuda with cuda-installation-guide-linux 12.1 documentation
I installed nccl with GitHub - NVIDIA/nccl: Optimized primitives for collective multi-GPU communication

The nccl test works without errors
nccl-test.txt (3.8 KB)

I tried the nvidia-driver-525, nvidia-driver-515 and nvidia-driver-510.
I even tried Ubuntu 20.04 and 18.04.

Morganh · March 29, 2023, 1:35am

Appreciate your work. Would you please help run training with one older version of TAO container? Since in 21.11 version, we did not receive training error of multi-gpus.
For yolov4, please

$ docker pull nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3

Then login the docker, and run command without “tao”. That means,

$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3

After login inside the docker, run

yolo_v4 train -e $SPECS_DIR/yolo_v4_train_resnet18_kitti.txt
-r $USER_EXPERIMENT_DIR/experiment_dir_unpruned
-k $KEY
--gpus 8

The docker is from TAO Toolkit for Computer Vision | NVIDIA NGC

deveso · March 29, 2023, 4:47pm

I’m pretty confused right now. I proceeded step by step as follows:

I followed the guide here using the “Launcher CLI” TAO Toolkit Quick Start Guide - NVIDIA Docs
After that I start jupyter notebook --no-browser --port=8080 --allow-root
Since I’m accessing remotely, I open a new terminal with the command: ssh -L 8080:localhost:8080 user@192.168.188.25
I run all the steps as described in: yolo_v4.ipynb
I get the error when trying to trining with more than 1 GPU

When I follow your approach and want to start jupyter in the docker I get the error “Connection refused”. So I mouted my previous project in docker and ran your training command. This works with 1GPU but not with 8 Error when training with multiple GPUs in TAO - #2 by Morganh

I get completely different error messages and it doesn’t work at all: Error when training with multiple GPUs in TAO - #6 by Morganh

I’m sorry but can you please explain the steps in a bit more detail?

Morganh · March 29, 2023, 5:03pm

OK, in short, the steps I mentioned above are going to use an old version of tao docker.
nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3

Currently, you are using 4.0.0 docker or 4.0.1 docker.

Can you run "docker pull nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3 " in your remote machine? If yes, you can run training as mentioned above.

deveso · March 29, 2023, 5:23pm

Yes I ran docker pull.
After that I ran docker run but I get the error:
chmod: cannot access ‘/opt/ngccli/ngc’: No such file or director

So I changed the command to and included my project.: docker run --runtime=nvidia -it --rm --entrypoint “” -v /home/user/getting_started_v4.0.1/:/data nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3 /bin/bash

Is that correct so far?

Then I ran the command and I get the error in the output
yolo_v4 train -e /data/notebooks/tao_launcher_starter_kit/yolo_v4/specs/yolo_v4_train_resnet18_kitti.txt -r /workspace/yolo_v4/experiment_dir_unpruned10 -k myKEY --gpus 8
output.txt (34.6 KB)

Morganh · March 30, 2023, 2:36am

Correct. From the log, could you please delete "“visualizer” section in yolo_v4_train_resnet18_kitti.txt ? Since previous version does not have this feature.

deveso · March 31, 2023, 4:34pm

Many thanks :) It works now. All 8 graphics cards are used.
Can you explain to me what caused the problem?

Can I also run without docker run --runtime=nvidia as described? Error when training with multiple GPUs in TAO - #7 by deveso

Morganh · March 31, 2023, 4:52pm

Not sure. Need to check further. So, could you help conclude,
nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3 ==> work
nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5 ==> cannot work

Beside the docker, is there any difference in the environment, for example, nvidia-smi, etc?

Actually the way you mention is using tao launcher. I think you already update the tao launcher to the latest. So, the tao launcher will be calling latest 4.0.1 docker.
It is similar to “docker run”. By using “docker run”, you can also trigger notebooks.

deveso · March 31, 2023, 5:53pm

Unfortunately I couldn’t find any differences. Using “docker run” is not a problem. I managed to get the notebooks to run remotely in docker with port forwarding.

However, the GPU utilization worries me a bit. If I train with the standard configuration, the graphics cards are only slightly utilized. What can be the reason?

Screenshot 2023-03-31 193423

I would like to use TensorBoard. But that’s no longer possible because the version doesn’t support it, right? That’s why I had to delete the visualizer section, right?

Morganh · April 2, 2023, 4:13pm

The lower GPU utilization is another topic which is similar to (topic) .We are tracking it.

Yes. 21.11 version has not this feature yet. But 22.05 has the feature for tensorboard. Can you use nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3 instead?

deveso · April 3, 2023, 12:23pm

Version 22.05 also works with 8 GPUs. Many thanks for the help!

I will follow up on the topic because of the bad performance.

yingliu · April 18, 2023, 4:16am

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please reopen this topic or open a new one. Thanks

Morganh · April 20, 2023, 5:56am

For using 4.0.1 docker with multi gpus, if you have time, please install the old nccl 2.11.4 version instead and try to run.
Login the 4.0.1 docker and install it.

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /"
sudo apt-get update
sudo apt install libnccl2=2.11.4-1+cuda11.6 libnccl-dev=2.11.4-1+cuda11.6
ldconfig -v | grep "libnccl.so" | tail -n1 | sed -r 's/^.*\.so\.//'

If successfully, you can run “docker commit” to generate a new docker for future use.

Topic		Replies	Views
TAO training on multiple gpus failed TAO Toolkit	10	1331	March 9, 2023
Training with multiple GPUs has error using TAO toolkit TAO Toolkit	17	1426	July 19, 2022
More than 1 GPU not working using Tao Train TAO Toolkit	47	5433	April 9, 2023
TAO 4.0 AutoML - the provided PTX was compiled with an unsupported toolchain TAO Toolkit	6	789	July 17, 2023
Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) TAO Toolkit gpio , tao	6	344	May 21, 2024
Tao multiple - GPUs TAO Toolkit	6	970	February 8, 2022
TAO yolov4_tiny training fails with error TAO Toolkit	4	667	February 2, 2023
No CUDA-capable device is detected - yolov4 TAO Toolkit	10	367	August 16, 2024
Error during multi-GPU training of classification_tf1: cma_ep.c process_vm_readv Operation not permitted TAO Toolkit	30	2432	June 1, 2023
Multigpu training raises error TAO Toolkit	9	1283	November 15, 2022

Error when training with multiple GPUs in TAO

Related topics