Error when training with multiple GPUs in TAO

Morganh · March 29, 2023, 1:35am

Appreciate your work. Would you please help run training with one older version of TAO container? Since in 21.11 version, we did not receive training error of multi-gpus.
For yolov4, please

$ docker pull nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3

Then login the docker, and run command without “tao”. That means,

$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3

After login inside the docker, run

yolo_v4 train -e $SPECS_DIR/yolo_v4_train_resnet18_kitti.txt
-r $USER_EXPERIMENT_DIR/experiment_dir_unpruned
-k $KEY
--gpus 8

The docker is from TAO Toolkit for Computer Vision | NVIDIA NGC