Error when training with multiple GPUs in TAO

I’m trying to run the example code in yolo_v4.ipynb jupyter notebook using TAO.
For this I use a Supermicro Server 4028gr-tvrt with 8x Tesla V100. When I train with one graphics card, everything works. However, if I want to train with all 8 graphics cards, I get an error.

print(“To run with multigpu, please change --gpus based on the number of available GPUs in your machine.”)
!tao yolo_v4 train -e $SPECS_DIR/yolo_v4_train_resnet18_kitti.txt
-r $USER_EXPERIMENT_DIR/experiment_dir_unpruned
-k $KEY
–gpus 8

Here is the error message and the nvida-smi output:
log.txt (169.0 KB)

nvida-smi.txt (3.6 KB)

Hi,
Could you pull the 4.0.1 version of TAO docker and retry?

$ docker pull nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5

Then, login in the docker
$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5 /bin/bash

Then, run the commands. Note, “tao” is not needed when run inside the docker.

yolo_v4 train -e $SPECS_DIR/yolo_v4_train_resnet18_kitti.txt
-r $USER_EXPERIMENT_DIR/experiment_dir_unpruned
-k $KEY
--gpus 8

Hi, I just tried it. With --gpu 1 it works but with --gpu > 1 I get the same error.

Can you use an old version of nvidia-driver?

Uninstall:
sudo apt purge nvidia-driver-525
sudo apt autoremove
sudo apt autoclean

Install:
sudo apt install nvidia-driver-515

I’ve tried everything and have no idea what the problem could be. I followed all the steps from the post More than 1 GPU not working using Tao Train - #22 by user82614

I installed cuda with cuda-installation-guide-linux 12.1 documentation
I installed nccl with GitHub - NVIDIA/nccl: Optimized primitives for collective multi-GPU communication

The nccl test works without errors
nccl-test.txt (3.8 KB)

I tried the nvidia-driver-525, nvidia-driver-515 and nvidia-driver-510.
I even tried Ubuntu 20.04 and 18.04.

Appreciate your work. Would you please help run training with one older version of TAO container? Since in 21.11 version, we did not receive training error of multi-gpus.
For yolov4, please

$ docker pull nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3

Then login the docker, and run command without “tao”. That means,

$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3

After login inside the docker, run

yolo_v4 train -e $SPECS_DIR/yolo_v4_train_resnet18_kitti.txt
-r $USER_EXPERIMENT_DIR/experiment_dir_unpruned
-k $KEY
--gpus 8

The docker is from TAO Toolkit for Computer Vision | NVIDIA NGC

I’m pretty confused right now. I proceeded step by step as follows:

  1. I followed the guide here using the “Launcher CLI” TAO Toolkit Quick Start Guide - NVIDIA Docs
  2. After that I start jupyter notebook --no-browser --port=8080 --allow-root
  3. Since I’m accessing remotely, I open a new terminal with the command: ssh -L 8080:localhost:8080 user@192.168.188.25
  4. I run all the steps as described in: yolo_v4.ipynb
  5. I get the error when trying to trining with more than 1 GPU

When I follow your approach and want to start jupyter in the docker I get the error “Connection refused”. So I mouted my previous project in docker and ran your training command. This works with 1GPU but not with 8 Error when training with multiple GPUs in TAO - #2 by Morganh

I get completely different error messages and it doesn’t work at all: Error when training with multiple GPUs in TAO - #6 by Morganh

I’m sorry but can you please explain the steps in a bit more detail?

OK, in short, the steps I mentioned above are going to use an old version of tao docker.
nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3

Currently, you are using 4.0.0 docker or 4.0.1 docker.

Can you run "docker pull nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3 " in your remote machine? If yes, you can run training as mentioned above.

Yes I ran docker pull.
After that I ran docker run but I get the error:
chmod: cannot access ‘/opt/ngccli/ngc’: No such file or director

So I changed the command to and included my project.: docker run --runtime=nvidia -it --rm --entrypoint “” -v /home/user/getting_started_v4.0.1/:/data nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3 /bin/bash

Is that correct so far?

Then I ran the command and I get the error in the output
yolo_v4 train -e /data/notebooks/tao_launcher_starter_kit/yolo_v4/specs/yolo_v4_train_resnet18_kitti.txt -r /workspace/yolo_v4/experiment_dir_unpruned10 -k myKEY --gpus 8
output.txt (34.6 KB)

Correct. From the log, could you please delete "“visualizer” section in yolo_v4_train_resnet18_kitti.txt ? Since previous version does not have this feature.

Many thanks :) It works now. All 8 graphics cards are used.
Can you explain to me what caused the problem?

Can I also run without docker run --runtime=nvidia as described? Error when training with multiple GPUs in TAO - #7 by deveso

Not sure. Need to check further. So, could you help conclude,
nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3 ==> work
nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5 ==> cannot work

Beside the docker, is there any difference in the environment, for example, nvidia-smi, etc?

Actually the way you mention is using tao launcher. I think you already update the tao launcher to the latest. So, the tao launcher will be calling latest 4.0.1 docker.
It is similar to “docker run”. By using “docker run”, you can also trigger notebooks.

Unfortunately I couldn’t find any differences. Using “docker run” is not a problem. I managed to get the notebooks to run remotely in docker with port forwarding.

However, the GPU utilization worries me a bit. If I train with the standard configuration, the graphics cards are only slightly utilized. What can be the reason?

Screenshot 2023-03-31 193423

I would like to use TensorBoard. But that’s no longer possible because the version doesn’t support it, right? That’s why I had to delete the visualizer section, right?

The lower GPU utilization is another topic which is similar to (topic) .We are tracking it.

Yes. 21.11 version has not this feature yet. But 22.05 has the feature for tensorboard. Can you use nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3 instead?

Version 22.05 also works with 8 GPUs. Many thanks for the help!

I will follow up on the topic because of the bad performance.

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please reopen this topic or open a new one. Thanks

For using 4.0.1 docker with multi gpus, if you have time, please install the old nccl 2.11.4 version instead and try to run.
Login the 4.0.1 docker and install it.

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /"
sudo apt-get update
sudo apt install libnccl2=2.11.4-1+cuda11.6 libnccl-dev=2.11.4-1+cuda11.6
ldconfig -v | grep "libnccl.so" | tail -n1 | sed -r 's/^.*\.so\.//'

If successfully, you can run “docker commit” to generate a new docker for future use.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.