Low FPS for Frcnn model

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU) Tx2
• DeepStream Version 5.1
• JetPack Version (valid for Jetson only) 4.5.1
• TensorRT Version 7.1.3
• NVIDIA GPU Driver Version (valid for GPU only)
• Issue Type( questions, new requirements, bugs)
• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)
• Requirement details( This is for new requirement. Including the module name-for which plugin or for which sample application, the function description)

Hi, currently i have two trained object detection models (yolov4 and frcnn). Both is of TLT ver 3.0. i managed to export both models of data type fp32 and also managed to run both models in the deepstream sdk. However, the problem is that fps average fps timing for the frcnn model is around 1.5 sec while the average fps timing for the yolov4 model is around 14sec. the model size for the yolov4 model is 544x960 while the model size for the frcnn model is 540x960. Was wondering if this is common or is there an issue with my work. If it is an issue, can u help advise on how to solve it? Here are my codes, engine files and configuration settings if required…
pgie_frcnn_tlt_config.txt (2.1 KB)
frcnn_resnet_18.epoch74_fp32.etlt_b1_gpu0_fp32.engine (72.3 MB)
deepstream_frcnn_drone_v5.py (12.7 KB)

i couldnt upload my encodded video file as the file size is above 100 mb

Hi,

Would you mind uploading the data to an online drive and sharing the link with us?

Please note that we have a newer Deepstream 6.0 package.
It’s always recommended to use our latest software for the best performance.

Here are two things want to check with you first.
1. Have you maximized the device performance first?

$ sudo nvpmodel -m 2
$ sudo jetson_clocks

2. Could you double-check if the TensorRT engine is re-generated every launch?
Or it de-serializes the model directly.

Thanks.

Hello, thank you for your reply. I am unable to convert to deepstream 6.0 as there is the deadline for this project is coming to an end. Yes i did try to maximize the device performance but the FPS timing for the frcnn model remains unchanged. Whereas for the TensorRT engine, it does not re-generated every launch as i commented out the tlt-encoded-model variable in the configuration file. Here is the link to the encoded video file: https://drive.google.com/drive/folders/1FbgJQHZNzawtbMUTj9EQ1XsvwKdnLroR?usp=sharing
Thank you so much for your help

Hi,

We want to reproduce this issue in our environment to check it in-depth.
Would you mind sharing the frcnn_resnet_18.epoch74_fp32.etlt model with us as well?

Thanks.

Here is the exported model
frcnn_resnet_18.epoch74_fp32.etlt (47.4 MB)

Hi,

We saw you have improved the performance of the YOLOv4 case.
Is the fix also works for the FRCNN case?

Thanks.

The fix doesnt work for the Frcnn model. I have tried running the frcnn model through a .h264 video file instead of doing a live recording, the processing rate still remains at ~1fps.

Hi,

We run your model with TensorRT through tao-converter.

$ ./tao-converter -k nvidia_tlt -d 3,540,960 frcnn_resnet_18.epoch74_fp32.etlt -t fp32 -e tmp.engine
$ /usr/src/tensorrt/bin/trtexec --loadEngine=tmp.engine

The model is relatively complicated and it takes around 489.346 ms to finish one inference.
So the ~1fps performance sounds reasonable after including all the components in the pipeline.

[12/28/2021-08:00:34] [I] === Performance summary ===
[12/28/2021-08:00:34] [I] Throughput: 2.04354 qps
[12/28/2021-08:00:34] [I] Latency: min = 485.931 ms, max = 496.887 ms, mean = 489.338 ms, median = 488.046 ms, percentile(99%) = 496.887 ms
[12/28/2021-08:00:34] [I] End-to-End Host Latency: min = 485.94 ms, max = 496.895 ms, mean = 489.346 ms, median = 488.053 ms, percentile(99%) = 496.895 ms
[12/28/2021-08:00:34] [I] Enqueue Time: min = 2.13623 ms, max = 2.2998 ms, mean = 2.22013 ms, median = 2.21877 ms, percentile(99%) = 2.2998 ms
[12/28/2021-08:00:34] [I] H2D Latency: min = 0.301758 ms, max = 0.308838 ms, mean = 0.305035 ms, median = 0.305237 ms, percentile(99%) = 0.308838 ms
[12/28/2021-08:00:34] [I] GPU Compute Time: min = 485.62 ms, max = 496.578 ms, mean = 489.029 ms, median = 487.739 ms, percentile(99%) = 496.578 ms
[12/28/2021-08:00:34] [I] D2H Latency: min = 0.00292969 ms, max = 0.00537109 ms, mean = 0.00452881 ms, median = 0.0045166 ms, percentile(99%) = 0.00537109 ms
[12/28/2021-08:00:34] [I] Total Host Walltime: 4.89347 s
[12/28/2021-08:00:34] [I] Total GPU Compute Time: 4.89029 s
[12/28/2021-08:00:34] [I] Explanations of the performance metrics are printed in the verbose logs.
[12/28/2021-08:00:34] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8001] # /usr/src/tensorrt/bin/trtexec --loadEngine=tmp.engine
[12/28/2021-08:00:34] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 692, GPU 5988 (MiB)

Below are some improvements you can try:

  • Use fp16 mode instead
  • Apply pruning at the training time.

Thanks.

Hi I did run apply pruning when training my model and also tried exporting in fp16 mode. However, for my fp16 model, the inference runtime still takes ~2fps which is still pretty slow, was wondering is there any other way to speed up the fps timing

I have another model Yolov4, trained on the same data, and the model size is relative similar ( 544, 960). Both the fp16 Yolov4 and frcnn model run at the same deepstream code, the fps for yolov4 is ~16fps but for the frcnn is ~2fps. May i know the reason why?

Hi,

That’s because FRCNN is a relatively complicated model.
The inference time depends on the depth and layer type used in the model.

For pruning, below is document for your reference:
https://docs.nvidia.com/tao/archive/tlt-20/tlt-user-guide/text/pruning_model.html

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.