TAO5 - Detectnet_v2 - MultiGPU TAO API Stuck

Also, could you please run mpirun as well? Thanks a lot.
The 2nd experiment here.
root@debug:/workspace# mpirun --allow-run-as-root -np 2 -x HOROVOD_LOG_LEVEL=DEBUG python /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/train.py -e /workspace/tao-experiments/specs/detectnet_v2_train_peoplenet_kitti_multi.txt -r /workspace/results -k tlt_encode

The 3rd experiment here.
Please use attached utils_new.py
utils_new.py (7.5 KB) and replace it.

$ cp utils_new.py /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/tfhooks/utils.py
Run

root@debug:/workspace# detectnet_v2 train -e /workspace/tao-experiments/specs/detectnet_v2_train_peoplenet_kitti_multi_kubernetes.txt -r /workspace/tao-experiments/multilabel_test_TAO5/experiment_dir_unpruned_visualizer_lowbatch_reducedTfrecord_NoAutoResize/ -n resnet34_detector --gpus 2 -k tlt_encode

or

root@debug:/workspace# mpirun --allow-run-as-root -np 2 -x HOROVOD_LOG_LEVEL=DEBUG  python /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/train.py -e /workspace/tao-experiments/specs/detectnet_v2_train_peoplenet_kitti_multi.txt -r /workspace/results -k tlt_encode

Hi,
Luckily when I try to lease a new machine which has 2xA100, I can reproduce the issue now when --use_amp is used in 2gpus training.

When not use --use_amp, issue is gone.

Summary:

  1. visualizer.enable:true + use_amp in commandline ==> issue happens
  2. visualizer.enable:true + not use_amp in commandline ==> issue does not happen
  3. visualizer.enable:false + use_amp in commandline ==> issue does not happen
  4. visualizer.enable:false + not use_amp in commandline ==> issue does not happen
1 Like

Glad to know that!

I hope I can do the tests during the daytime.

Try to do this tests? Or you have a better clue from where appear the issue?

The issue happens when amp is enabled and visualizer.enable is true.
We’re looking into it.
For workaround, please disable amp or disable visualizer during training.

1 Like

News? I don’t want to auto close the post.

Hi,
TAO team is still working on the issue when amp is enabled and visualizer.enable is true. Will update to you if there is update. Thanks.

1 Like

Hi,
Please delete below two lines in above file. Issue will be gone.

868    else:
869        visualizer_config.enabled = False
1 Like

Try to test today this point.

Use the default files? or need to re-emplace some of that? (utils_new.py; detectnet_model.py …)

Please use default files and apply for above-mentioned two lines.

1 Like

YEs, using the file attached and deleting the two lines works!

How can implement that in the Kluster? :) or you will release a minor update?

There is not a minor update. You can commit a new docker as we synced previously in another topic.