I’m trying to run the example code in yolo_v4.ipynb jupyter notebook using TAO.
For this I use a Supermicro Server 4028gr-tvrt with 8x Tesla V100. When I train with one graphics card, everything works. However, if I want to train with all 8 graphics cards, I get an error.
print(“To run with multigpu, please change --gpus based on the number of available GPUs in your machine.”)
!tao yolo_v4 train -e $SPECS_DIR/yolo_v4_train_resnet18_kitti.txt
-r $USER_EXPERIMENT_DIR/experiment_dir_unpruned
-k $KEY
–gpus 8
Here is the error message and the nvida-smi output: log.txt (169.0 KB)
Appreciate your work. Would you please help run training with one older version of TAO container? Since in 21.11 version, we did not receive training error of multi-gpus.
For yolov4, please
After that I start jupyter notebook --no-browser --port=8080 --allow-root
Since I’m accessing remotely, I open a new terminal with the command: ssh -L 8080:localhost:8080 user@192.168.188.25
I run all the steps as described in: yolo_v4.ipynb
I get the error when trying to trining with more than 1 GPU
When I follow your approach and want to start jupyter in the docker I get the error “Connection refused”. So I mouted my previous project in docker and ran your training command. This works with 1GPU but not with 8 Error when training with multiple GPUs in TAO - #2 by Morganh
Then I ran the command and I get the error in the output
yolo_v4 train -e /data/notebooks/tao_launcher_starter_kit/yolo_v4/specs/yolo_v4_train_resnet18_kitti.txt -r /workspace/yolo_v4/experiment_dir_unpruned10 -k myKEY --gpus 8 output.txt (34.6 KB)
Correct. From the log, could you please delete "“visualizer” section in yolo_v4_train_resnet18_kitti.txt ? Since previous version does not have this feature.
Beside the docker, is there any difference in the environment, for example, nvidia-smi, etc?
Actually the way you mention is using tao launcher. I think you already update the tao launcher to the latest. So, the tao launcher will be calling latest 4.0.1 docker.
It is similar to “docker run”. By using “docker run”, you can also trigger notebooks.
Unfortunately I couldn’t find any differences. Using “docker run” is not a problem. I managed to get the notebooks to run remotely in docker with port forwarding.
However, the GPU utilization worries me a bit. If I train with the standard configuration, the graphics cards are only slightly utilized. What can be the reason?
I would like to use TensorBoard. But that’s no longer possible because the version doesn’t support it, right? That’s why I had to delete the visualizer section, right?
There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please reopen this topic or open a new one. Thanks
For using 4.0.1 docker with multi gpus, if you have time, please install the old nccl 2.11.4 version instead and try to run.
Login the 4.0.1 docker and install it.