Please provide the following information when requesting support.
• Hardware: RTX4090
• Network Type: Detectnet_v2
• TLT Version 4.0.2
• How to reproduce the issue ?
Is any form to indicate in the training/retraining process that use the “”–use_amp" option? Can’t fount it in the API protocol , also is not mention in the Client App, and in the specs files is only appear in the “mask_rcnn”.
If not, can be modify manually in the POD code?
Thanks in advance
You can refer to DetectNet_v2 - NVIDIA Docs . The detectnet_v2 supports running with use_amp
. For TAO API, could you go inside the workflow pod:
$ kubectl exec -it tao-toolkit-api-workflow-pod-xxxxx-yyyy -- /bin/bash
Vim /opt/handlers/network_configs/detectnet_v2.config.json to add "use_amp":"true"
in cli_params->train
then try again?
1 Like
That’s the answer I need!
Thank you so much. Ping you when can I test it successfully.
Hi,
Please use below way instead.
Set below in the notebook,
specs[“use_amp”]=True
and then go inside the workflow pod:
$ kubectl exec -it tao-toolkit-api-workflow-pod-xxxxx-yyyy -- /bin/bash
Vim /opt/handlers/network_configs/detectnet_v2.config.json to add "use_amp":"from_csv"
in cli_params->train
Please let me know if it works. Thanks.
Summary
Morganh:
specs[“use_amp”]=True
Well, the command is inserted in the system, but not launch the train.
Start all the process, load the VRAM of the GPU but get frezee in the start of the training.
INFO:tensorflow:Graph was finalized.
2023-05-25 15:17:29,549 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Running local_init_op.
2023-05-25 15:17:32,472 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-05-25 15:17:33,076 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Saving checkpoints for step-0.
2023-05-25 15:17:43,655 [INFO] tensorflow: Saving checkpoints for step-0.
I remove the command from the train specs and lauch correctly.
Hide this information to not confuse the readers. I have a blend of problems appart of that.
Morganh:
“use_amp”:“from_csv”
Yes is working! .
This is the log in the “nvtop”, with the --use_amp in the last part. And the new framerate is higher than before!
424MiB 1% 0% 676MiB /usr/bin/python3.6 /usr/local/bin/detectnet_v2 train --gpus 1 --experiment_spec_file /shared/users/xx/models/67c90d9b-19e8-41c7-8baf-30c0e8a3f8e3/specs/161c6ee5-4231-4581-8f91-351cd34af5f2.yaml --results_dir /shared/users/xx/models/67c90d9b-19e8-41c7-8baf-30c0e8a3f8e3/161c6ee5-4231-4581-8f91-351cd34af5f2 --verbose --key tlt_encode --use_amp
I’m deploying a new equipment with multi-gpu, and i’m blending problems. :)
Thankyou!
1 Like