March 27, 2023, 3:19am
I’m trying to run the example code in yolo_v4.ipynb jupyter notebook using TAO.
For this I use a Supermicro Server 4028gr-tvrt with 8x Tesla V100. When I train with one graphics card, everything works. However, if I want to train with all 8 graphics cards, I get an error.
print(“To run with multigpu, please change --gpus based on the number of available GPUs in your machine.”)
!tao yolo_v4 train -e $SPECS_DIR/yolo_v4_train_resnet18_kitti.txt
Here is the error message and the nvida-smi output:
log.txt (169.0 KB)
nvida-smi.txt (3.6 KB)
Could you pull the 4.0.1 version of TAO docker and retry?
$ docker pull
Then, login in the docker
$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5 /bin/bash
Then, run the commands. Note, “tao” is not needed when run inside the docker.
yolo_v4 train -e $SPECS_DIR/yolo_v4_train_resnet18_kitti.txt
March 27, 2023, 6:37am
Hi, I just tried it. With --gpu 1 it works but with --gpu > 1 I get the same error.
Can you use an old version of nvidia-driver?
sudo apt purge nvidia-driver-525
sudo apt autoremove
sudo apt autoclean
sudo apt install nvidia-driver-515
March 28, 2023, 7:49pm
I’ve tried everything and have no idea what the problem could be. I followed all the steps from the post
More than 1 GPU not working using Tao Train - #22 by user82614
I installed cuda with
cuda-installation-guide-linux 12.1 documentation
I installed nccl with GitHub - NVIDIA/nccl: Optimized primitives for collective multi-GPU communication
The nccl test works without errors
nccl-test.txt (3.8 KB)
I tried the nvidia-driver-525, nvidia-driver-515 and nvidia-driver-510.
I even tried Ubuntu 20.04 and 18.04.
Appreciate your work. Would you please help run training with one older version of TAO container? Since in 21.11 version, we did not receive training error of multi-gpus.
For yolov4, please
$ docker pull nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3
Then login the docker, and run command without “tao”. That means,
$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3
After login inside the docker, run
yolo_v4 train -e $SPECS_DIR/yolo_v4_train_resnet18_kitti.txt
The docker is from
TAO Toolkit for Computer Vision | NVIDIA NGC
March 29, 2023, 4:47pm
I’m pretty confused right now. I proceeded step by step as follows:
I followed the guide here using the “Launcher CLI”
TAO Toolkit Quick Start Guide - NVIDIA Docs
After that I start jupyter notebook --no-browser --port=8080 --allow-root
Since I’m accessing remotely, I open a new terminal with the command: ssh -L 8080:localhost:8080
I run all the steps as described in: yolo_v4.ipynb
I get the error when trying to trining with more than 1 GPU
When I follow your approach and want to start jupyter in the docker I get the error “Connection refused”. So I mouted my previous project in docker and ran your training command. This works with 1GPU but not with 8
Error when training with multiple GPUs in TAO - #2 by Morganh
I get completely different error messages and it doesn’t work at all:
Error when training with multiple GPUs in TAO - #6 by Morganh
I’m sorry but can you please explain the steps in a bit more detail?
OK, in short, the steps I mentioned above are going to use an old version of tao docker.
Currently, you are using 4.0.0 docker or 4.0.1 docker.
Can you run "docker pull
nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3 " in your remote machine? If yes, you can run training as mentioned above.
March 29, 2023, 5:23pm
Yes I ran docker pull.
After that I ran docker run but I get the error:
chmod: cannot access ‘/opt/ngccli/ngc’: No such file or director
So I changed the command to and included my project.: docker run --runtime=nvidia -it --rm --entrypoint “” -v /home/user/getting_started_v4.0.1/:/data
Is that correct so far?
Then I ran the command and I get the error in the output
yolo_v4 train -e /data/notebooks/tao_launcher_starter_kit/yolo_v4/specs/yolo_v4_train_resnet18_kitti.txt -r /workspace/yolo_v4/experiment_dir_unpruned10 -k myKEY --gpus 8
output.txt (34.6 KB)
Correct. From the log, could you please delete "“visualizer” section in yolo_v4_train_resnet18_kitti.txt ? Since previous version does not have this feature.
March 31, 2023, 4:34pm
Many thanks :) It works now. All 8 graphics cards are used.
Can you explain to me what caused the problem?
Can I also run without docker run --runtime=nvidia as described?
Error when training with multiple GPUs in TAO - #7 by deveso
Not sure. Need to check further. So, could you help conclude,
nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3 ==> work
nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5 ==> cannot work
Beside the docker, is there any difference in the environment, for example, nvidia-smi, etc?
Actually the way you mention is using tao launcher. I think you already update the tao launcher to the latest. So, the tao launcher will be calling latest 4.0.1 docker.
It is similar to “docker run”. By using “docker run”, you can also trigger notebooks.
March 31, 2023, 5:53pm
Unfortunately I couldn’t find any differences. Using “docker run” is not a problem. I managed to get the notebooks to run remotely in docker with port forwarding.
However, the GPU utilization worries me a bit. If I train with the standard configuration, the graphics cards are only slightly utilized. What can be the reason?
I would like to use TensorBoard. But that’s no longer possible because the version doesn’t support it, right? That’s why I had to delete the visualizer section, right?
The lower GPU utilization is another topic which is similar to (
topic) .We are tracking it.
Yes. 21.11 version has not this feature yet. But 22.05 has the feature for tensorboard. Can you use
April 3, 2023, 12:23pm
Version 22.05 also works with 8 GPUs. Many thanks for the help!
I will follow up on the
topic because of the bad performance.
There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please reopen this topic or open a new one. Thanks
For using 4.0.1 docker with multi gpus, if you have time, please install the old nccl 2.11.4 version instead and try to run.
Login the 4.0.1 docker and install it.
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /"
sudo apt-get update
sudo apt install libnccl2=2.11.4-1+cuda11.6 libnccl-dev=2.11.4-1+cuda11.6
ldconfig -v | grep "libnccl.so" | tail -n1 | sed -r 's/^.*\.so\.//'
If successfully, you can run “docker commit” to generate a new docker for future use.
May 4, 2023, 5:56am
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.