Tao yolov4 pruned model is stuck at 6.5 FPS

• Hardware : A5000
• Network Type: Yolo_v4
• TLT Version: 3.22.05
• Training spec file :
d26_yolov4_apm_apr1924_pruned_retrain_v5.txt (3.0 KB)

The model trained with the above spec file, after exporting and converting to an engine file with int8 precision and batch size 1 when tested using trtexec gives a maximum of 6.5 FPS. Log details already shared in a previous thread Low FPS for pruned tao toolkit models on deepstream - #30 by Fiona.Chen

Pruning ratio for the model is 0.57

How do I train a yolov4 model on tao toolkit which will give me 15 FPS on trtexec?

From your log in comment Low FPS for pruned tao toolkit models on deepstream - #16 by adithya.ajith,

[08/01/2024-12:52:30] [I] GPU Compute Time: min = 4.38373 ms, max = 4.71448 ms, mean = 4.43574 ms, median = 4.43896 ms, percentile(90%) = 4.44519 ms, percentile(95%) = 4.46667 ms, percentile(99%) = 4.47385 ms

It is about 1000/4.43574 = 225fps.

May I know the log for 6.5 fps?

Look at this in the context of my above forum question, I am looking to achieve 15 FPS for 30 cameras. This would mean the sum of my compute time, d2h and h2d latencies should come down by more than half from my current value of roughly 5.45ms.

I am looking to bring down the inference time for the model.

Suggest you to use YOLO_v4_tiny network.

I cannot change the network architecture and the input size because of accuracy constraints. Are there any other options ?

Actually the YOLO_v4_tiny is just changing to another backbone comparing to YOLO_v4.
You can setup similar experiments to run training and check the mAP result.

Before moving on to a different model, can you tell me if there is anyway to extract more FPS out of a Yolov4 model, since it is a tried and tested model in terms of accuracy for my usecase.

  • I am specifically asking in terms of changes in the training config, for example can you recommend a different architecture which is lighter but is comparable to resnet18 in terms of feature extraction.
  • A different pruning to my current pruning (cmd used tao yolo_v4 prune -m <model-path> -o <output-path> -k <key> -e <path-to-training-config> -pth 0.5 -eq intersection).
  • Any changes recommended while exporting the model (cmd used tao yolo_v4 export -m <model-path> -o <path-to-.etlt-file> -k <key> --data_type int8 -e <path-to-training-config> --cal_cache_file <path-to-cal.bin-file>).
  • Any recommendations for engine file creation, currently the model’s engine file is created when the deepstream (ver6.3) pipeline starts.

Also about Yolov4_tiny are you saying that it has the same architecture as Yolov4 and the only difference is the difference in the backbones supported.

One important thing is for the mAP. You mention you are using TAO3.22.05. Can you use a newer version of TAO docker to train? As mentioned in another topic, TAO5.0(or 4.0.0 or 4.0.1) version can improve the mAP by fixing issues in the yolov4 structure, the loss function, and etc.
For backbone, you can run experiments on mobilenet_v1 or mobilenet_v2.
For pruning, after pruning, it is needed to run training against the pruned model to retain similar mAP. You can prune a bit → retrain → prune a bit → retrain → etc.
For exporting and engine generation, suggest you to use TAO5.0 version. It will export to onnx file and then you can run trtexec to generate tensorrt engine.
For last question, yes, it is.

I am happy with the mAP of the model (trained on 3.22.05). I don’t want to migrate to 5.x for better accuracy but it would make sense if the overall upgrades like change in the yolov4 structure etc that you have mentioned also results in performance improvements for the model.

Also we experimented with TAO 5.x but it has an issue with the validation tfrecords that are generated, this in turn results in wrong mAP calculation, a team member of mine has discovered this, and he has already raised this issue in forums or is planning to.

Does using trtexec to generate tensorrt engine compared to generating it in the ds pipeline give a performance bump ?

As for the points regarding the backbones and pruning makes sense and is something that I can start experimenting.

No, it does not mean improving mAP. Just to provide another way to generate engine instead of using deepstream. One more option is to decode the .etlt model to onnx file. Refer to tao_toolkit_recipes/tao_forum_faq/FAQ.md at main · NVIDIA-AI-IOT/tao_toolkit_recipes · GitHub . Then you can also use trtexec.

To clear the confusion I am talking about performance (FPS) bump not mAP.

What can you tell me about my first question regarding performance improvement in 5.x vs 3.x ?

For the 5.x vs 3.x in YOLO_v4, it focuses on mAP only.

1 Like

Given the large input size for my model (1888*1056) will Mobilenet_v2 which a smaller backbone compared to resnets be able to extract features properly at all 3 scales, detection of small and medium sized objects are very important for the usecase where the model will be deployed. If you think my above point is valid, does it make sense to start my training experiments with the yolov4_tiny model.

For yolo_v4_tiny I cannot find the backbone cspdarknet_tiny_3l in ngc . Is there any other source for this model ?

There is one pretrained model “cspdarknet_tiny.hdf5” in ngc. For backbone cspdarknet_tiny_3l, you can use the cspdarknet_tiny.hdf5 as pretrained model.

May I know your thoughts on my question about Mobilenet_v2.

For mobilenet_v2, you can run the training to see if it can get competitive mAP as expected.

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.