TAO API - Detectnet_v2 - Activation option --use_amp

Please provide the following information when requesting support.

• Hardware: RTX4090
• Network Type: Detectnet_v2
• TLT Version 4.0.2
• How to reproduce the issue ?

Is any form to indicate in the training/retraining process that use the “”–use_amp" option? Can’t fount it in the API protocol, also is not mention in the Client App, and in the specs files is only appear in the “mask_rcnn”.

If not, can be modify manually in the POD code?

Thanks in advance

You can refer to DetectNet_v2 - NVIDIA Docs. The detectnet_v2 supports running with use_amp. For TAO API, could you go inside the workflow pod:

$ kubectl exec -it tao-toolkit-api-workflow-pod-xxxxx-yyyy -- /bin/bash

Vim /opt/handlers/network_configs/detectnet_v2.config.json to add "use_amp":"true" in cli_params->train

then try again?

1 Like

That’s the answer I need!

Thank you so much. Ping you when can I test it successfully.

Hi,
Please use below way instead.

Set below in the notebook,

specs[“use_amp”]=True

and then go inside the workflow pod:

$ kubectl exec -it tao-toolkit-api-workflow-pod-xxxxx-yyyy -- /bin/bash

Vim /opt/api/handlers/network_configs/detectnet_v2.config.json to add "use_amp":"from_csv" in cli_params->train

Please let me know if it works. Thanks.

Summary

Well, the command is inserted in the system, but not launch the train.

Start all the process, load the VRAM of the GPU but get frezee in the start of the training.

INFO:tensorflow:Graph was finalized.
2023-05-25 15:17:29,549 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Running local_init_op.
2023-05-25 15:17:32,472 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-05-25 15:17:33,076 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Saving checkpoints for step-0.
2023-05-25 15:17:43,655 [INFO] tensorflow: Saving checkpoints for step-0.

I remove the command from the train specs and lauch correctly.

Hide this information to not confuse the readers. I have a blend of problems appart of that.

Yes is working! .

This is the log in the “nvtop”, with the --use_amp in the last part. And the new framerate is higher than before!

424MiB   1%     0%    676MiB /usr/bin/python3.6 /usr/local/bin/detectnet_v2 train --gpus 1 --experiment_spec_file /shared/users/xx/models/67c90d9b-19e8-41c7-8baf-30c0e8a3f8e3/specs/161c6ee5-4231-4581-8f91-351cd34af5f2.yaml --results_dir /shared/users/xx/models/67c90d9b-19e8-41c7-8baf-30c0e8a3f8e3/161c6ee5-4231-4581-8f91-351cd34af5f2 --verbose --key tlt_encode --use_amp

I’m deploying a new equipment with multi-gpu, and i’m blending problems. :)

Thankyou!

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.