Also, could you please run mpirun as well? Thanks a lot.
The 2nd experiment here.
root@debug:/workspace# mpirun --allow-run-as-root -np 2 -x HOROVOD_LOG_LEVEL=DEBUG python /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/train.py -e /workspace/tao-experiments/specs/detectnet_v2_train_peoplenet_kitti_multi.txt -r /workspace/results -k tlt_encode
The 3rd experiment here.
Please use attached utils_new.py
utils_new.py (7.5 KB) and replace it.
$ cp utils_new.py /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/tfhooks/utils.py
Run
root@debug:/workspace# detectnet_v2 train -e /workspace/tao-experiments/specs/detectnet_v2_train_peoplenet_kitti_multi_kubernetes.txt -r /workspace/tao-experiments/multilabel_test_TAO5/experiment_dir_unpruned_visualizer_lowbatch_reducedTfrecord_NoAutoResize/ -n resnet34_detector --gpus 2 -k tlt_encode
or
root@debug:/workspace# mpirun --allow-run-as-root -np 2 -x HOROVOD_LOG_LEVEL=DEBUG python /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/train.py -e /workspace/tao-experiments/specs/detectnet_v2_train_peoplenet_kitti_multi.txt -r /workspace/results -k tlt_encode
Hi,
Luckily when I try to lease a new machine which has 2xA100, I can reproduce the issue now when --use_amp
is used in 2gpus training.
When not use --use_amp
, issue is gone.
Summary:
visualizer.enable:true + use_amp in commandline ==> issue happens
visualizer.enable:true + not use_amp in commandline ==> issue does not happen
visualizer.enable:false + use_amp in commandline ==> issue does not happen
visualizer.enable:false + not use_amp in commandline ==> issue does not happen
1 Like
Glad to know that!
I hope I can do the tests during the daytime.
Morganh:
Also, could you please run mpirun as well? Thanks a lot.
The 2nd experiment here.
root@debug:/workspace# mpirun --allow-run-as-root -np 2 -x HOROVOD_LOG_LEVEL=DEBUG python /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/train.py -e /workspace/tao-experiments/specs/detectnet_v2_train_peoplenet_kitti_multi.txt -r /workspace/results -k tlt_encode
The 3rd experiment here.
Please use attached utils_new.py
utils_new.py (7.5 KB) and replace it.
$ cp utils_new.py /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/tfhooks/utils.py
Run
root@debug:/workspace# detectnet_v2 train -e /workspace/tao-experiments/specs/detectnet_v2_train_peoplenet_kitti_multi_kubernetes.txt -r /workspace/tao-experiments/multilabel_test_TAO5/experiment_dir_unpruned_visualizer_lowbatch_reducedTfrecord_NoAutoResize/ -n resnet34_detector --gpus 2 -k tlt_encode
or
root@debug:/workspace# mpirun --allow-run-as-root -np 2 -x HOROVOD_LOG_LEVEL=DEBUG python /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/train.py -e /workspace/tao-experiments/specs/detectnet_v2_train_peoplenet_kitti_multi.txt -r /workspace/results -k tlt_encode
Try to do this tests? Or you have a better clue from where appear the issue?
The issue happens when amp is enabled and visualizer.enable is true.
We’re looking into it.
For workaround, please disable amp or disable visualizer during training.
1 Like
News? I don’t want to auto close the post.
Hi,
TAO team is still working on the issue when amp is enabled and visualizer.enable is true. Will update to you if there is update. Thanks.
1 Like
Morganh
September 12, 2023, 3:39am
71
Hi,
Please delete below two lines in above file. Issue will be gone.
868 else:
869 visualizer_config.enabled = False
1 Like
Try to test today this point.
Use the default files? or need to re-emplace some of that? (utils_new.py; detectnet_model.py …)
Morganh
September 13, 2023, 10:28am
73
Please use default files and apply for above-mentioned two lines.
1 Like
YEs, using the file attached and deleting the two lines works!
How can implement that in the Kluster? :) or you will release a minor update?
Morganh
September 19, 2023, 1:40pm
75
There is not a minor update. You can commit a new docker as we synced previously in another topic.