TAO5 - Detectnet_v2 - MultiGPU TAO API Stuck

Also, could you please run mpirun as well? Thanks a lot.
The 2nd experiment here.
root@debug:/workspace# mpirun --allow-run-as-root -np 2 -x HOROVOD_LOG_LEVEL=DEBUG python /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/train.py -e /workspace/tao-experiments/specs/detectnet_v2_train_peoplenet_kitti_multi.txt -r /workspace/results -k tlt_encode

The 3rd experiment here.
Please use attached utils_new.py
utils_new.py (7.5 KB) and replace it.

$ cp utils_new.py /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/tfhooks/utils.py
Run

root@debug:/workspace# detectnet_v2 train -e /workspace/tao-experiments/specs/detectnet_v2_train_peoplenet_kitti_multi_kubernetes.txt -r /workspace/tao-experiments/multilabel_test_TAO5/experiment_dir_unpruned_visualizer_lowbatch_reducedTfrecord_NoAutoResize/ -n resnet34_detector --gpus 2 -k tlt_encode

or

root@debug:/workspace# mpirun --allow-run-as-root -np 2 -x HOROVOD_LOG_LEVEL=DEBUG  python /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/train.py -e /workspace/tao-experiments/specs/detectnet_v2_train_peoplenet_kitti_multi.txt -r /workspace/results -k tlt_encode

Hi,
Luckily when I try to lease a new machine which has 2xA100, I can reproduce the issue now when --use_amp is used in 2gpus training.

When not use --use_amp, issue is gone.

Summary:

  1. visualizer.enable:true + use_amp in commandline ==> issue happens
  2. visualizer.enable:true + not use_amp in commandline ==> issue does not happen
  3. visualizer.enable:false + use_amp in commandline ==> issue does not happen
  4. visualizer.enable:false + not use_amp in commandline ==> issue does not happen
1 Like

Glad to know that!

I hope I can do the tests during the daytime.

Try to do this tests? Or you have a better clue from where appear the issue?

The issue happens when amp is enabled and visualizer.enable is true.
We’re looking into it.
For workaround, please disable amp or disable visualizer during training.

1 Like

News? I don’t want to auto close the post.

Hi,
TAO team is still working on the issue when amp is enabled and visualizer.enable is true. Will update to you if there is update. Thanks.

1 Like

Hi,
Please delete below two lines in above file. Issue will be gone.

868    else:
869        visualizer_config.enabled = False
1 Like

Try to test today this point.

Use the default files? or need to re-emplace some of that? (utils_new.py; detectnet_model.py …)

Please use default files and apply for above-mentioned two lines.

1 Like

YEs, using the file attached and deleting the two lines works!

How can implement that in the Kluster? :) or you will release a minor update?

There is not a minor update. You can commit a new docker as we synced previously in another topic.

Can you update this information in this topic, to get all the steps in only one post? or link to the correct one?
Thanks.

Steps:

  1. (Optional) Install docker if it is not available in your machine.
$ sudo apt-get update
$ sudo apt install docker.io
$ sudo chown local-morganh:docker /var/run/docker.sock
$ sudo usermod -a -G docker local-morganh
  1. Trigger tao-toolkit:5.0.0-tf1.15.5 docker.
$ docker run -it --rm nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5 /bin/bash

Do modification, to comment out line 860 and 861.
$ vim /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/train.py

 860     #else:
 861     #    visualizer_config.enabled = False
  1. Open another terminal. To generate a new docker.
$ docker ps    (to check the container ID)
Assume it is 0215b8997946, then
$ docker commit 0215b8997946 nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5-fix
  1. Uninstall tao-api
$ helm ls
$ helm delete tao-toolkit-api
  1. Download tao-api.
$ helm fetch https://helm.ngc.nvidia.com/nvidia/tao/charts/tao-toolkit-api-5.0.0.tgz --username='$oauthtoken' --password=<YOUR API KEY>
$ mkdir tao-toolkit-api && tar -zxvf tao-toolkit-api-5.0.0.tgz -C tao-toolkit-api
$ cd tao-toolkit-api
$ vim tao-toolkit-api/values.yaml
Change to below.
 32 imageTf1: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5-fix
 37 imageDefault: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5-fix
  1. Update tao-toolkit-api
$ helm install tao-toolkit-api tao-toolkit-api/ --namespace default
  1. Check the update.
$ kubectl get pods
$ kubectl describe pod tao-toolkit-api-app-pod-64dfd55495-kg55s

      IMAGE_TF1:            nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5-fix
      IMAGE_PYT:            nvcr.io/nvidia/tao/tao-toolkit:5.0.0-pyt
      IMAGE_TF2:            nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf2.11.0
      IMAGE_TAO_DEPLOY:     nvcr.io/nvidia/tao/tao-toolkit:5.0.0-deploy
      IMAGE_DEFAULT:        nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5-fix

1 Like

With this method when TAO create a new pod to do a work, crash.

Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  70s                default-scheduler  Successfully assigned default/04a10067-dbec-4d96-b94b-735b26f9f7db-ggwl5 to azken
  Normal   Pulling    25s (x3 over 69s)  kubelet            Pulling image "nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5-fix"
  Warning  Failed     23s (x3 over 67s)  kubelet            Failed to pull image "nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5-fix": rpc error: code = NotFound desc = failed to pull and unpack image "nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5-fix": failed to resolve reference "nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5-fix": nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5-fix: not found
  Warning  Failed     23s (x3 over 67s)  kubelet            Error: ErrImagePull
  Normal   BackOff    9s (x3 over 66s)   kubelet            Back-off pulling image "nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5-fix"
  Warning  Failed     9s (x3 over 66s)   kubelet            Error: ImagePullBackOff

Did you run docker commit to generate the docker nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5-fix?
Could you share the history of the commands?

I do the commit, and all the steps that you post.

But TAO use containerd by default, not doker. So we need to add the next steps to fix the issue:

docker save -o tao-tf1.tar nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5-fix
sudo ctr -n=k8s.io image import tao-tf1.tar

After that all works.

1 Like

I have bad news.

Today in a new train, the initial problem returns…

Using the fix.

    "gpus": 2,

        "visualizer": {
            "enabled": true,

and the results are followed by:

Missing ranks:
0: [DistributedAdamOptimizer_Allreduce/cond/HorovodAllreduce_mul_129_0, DistributedAdamOptimizer_Allreduce/cond_1/HorovodAllreduce_mul_130_0, DistributedAdamOptimizer_Allreduce/cond_10/HorovodAllreduce_mul_139_0, DistributedAdamOptimizer_Allreduce/cond_100/HorovodAllreduce_mul_229_0, DistributedAdamOptimizer_Allreduce/cond_101/HorovodAllreduce_mul_230_0, DistributedAdamOptimizer_Allreduce/cond_102/HorovodAllreduce_mul_231_0 ...]
1: [DistributedAdamOptimizer_Allreduce/cond/HorovodAllreduce_mul_139_0, DistributedAdamOptimizer_Allreduce/cond_1/HorovodAllreduce_mul_140_0, DistributedAdamOptimizer_Allreduce/cond_10/HorovodAllreduce_mul_149_0, DistributedAdamOptimizer_Allreduce/cond_100/HorovodAllreduce_mul_239_0, DistributedAdamOptimizer_Allreduce/cond_101/HorovodAllreduce_mul_240_0, DistributedAdamOptimizer_Allreduce/cond_102/HorovodAllreduce_mul_241_0 ...]

Exhausting topic…

Need to re-check the modification file.

Suggest you to double check if you modify the code.

You can also use the standalone debug pod to verify.
Also, I recall that you verify it previously as below.

Yeah, you are right.
For some reason, the changes are not included in the tf1.15.5-fix image. But it was well linked in the TAO-toolkit.

Redo again, and start the train.

Thanks again.