Low FPS for pruned tao toolkit models on deepstream

adithya.ajith · July 30, 2024, 11:51am

• Hardware Platform NVIDIA RTX A5000
• DeepStream Version 6.3
• TensorRT Version 8.5.1-1+cuda11.8
• NVIDIA GPU Driver Version (valid for GPU only) 535.183.01

Ran 30 1080p rtsp streams with a tao yolov4 model (tao toolkit 3.22.05) with input size (1888*1056) on a deepstream based python application made with references from deepstream sample python applications, the max FPS we were able to extract for the set of cameras was 130.
To improve the FPS, the model was pruned using tao’s prune command and the subsequent pruned model had a ratio of 0.57 to the unpruned model. This model when ran on the application with the same setup gave an FPS of 150. This is not a significant improvement as we had hoped. What could be the reason behind this and can we expect to improve our FPS significantly with respect to pruning the model ?

Fiona.Chen · July 31, 2024, 4:47am

Have you measured the performance of the model with trtexec tool?

adithya.ajith · July 31, 2024, 6:04am

No, our application saves frames to the disk. The FPS was calculated based on this.

Fiona.Chen · July 31, 2024, 6:11am

Please test the model performance with trtexec first, we need to know whether the model is the bottleneck or not.

Fiona.Chen · July 31, 2024, 6:14am

And please provide the pipeline and the configurations you used to test the FPS, it is important to set the correct parameters.

adithya.ajith · July 31, 2024, 9:31am

Above is the pipeline for the deepstream part.
This is the unpruned model’s nvinfer
pgie_d26_apr1924_apm_fframe_yolov4_resnet18_epoch_045_drop8.txt (4.4 KB)
This is the pruned model’s nvinfer
pgie_d26_apr1924_yolov4_resnet18_epoch_045_pruned_5_e046.txt (4.0 KB)

I will share the profiling data shortly for both the models.

Fiona.Chen · July 31, 2024, 10:00am

You have added videorate in the pipeline, this will control the FPS of the pipeline. Please remove the videorate since you are using local file. You just need to set “sync=false” with your sink element to make the pipeline run ASAP.

adithya.ajith · July 31, 2024, 1:13pm

Source will be rtsp feeds, unlike the local file as seen. Please refer the following diagram for looking at the sources

adithya.ajith · July 31, 2024, 1:56pm

Following are the profiling data for the pruned and unpruned models respectively.
profile_pruned.txt (20.2 KB)
profile_unpruned.txt (19.8 KB)

Fiona.Chen · August 1, 2024, 2:18am

We don’t need the profiling data. Please provide “trtexec” inferencing log.

adithya.ajith · August 1, 2024, 5:27am

Can you be specific about what you mean by “inferencing log” in the trtexec command’s reporting options.

Fiona.Chen · August 1, 2024, 5:42am

Take the PeopleNet | NVIDIA NGC as the example, download the deployable_quantized_onnx_v2.6.2 version onnx model and run the “trtexec” build and inferencing command
trtexec --onnx=./resnet34_peoplenet_int8.onnx --int8 --calib=./resnet34_peoplenet_int8.txt --saveEngine=./resnet34_peoplenet_int8.onnx_b1_gpu0_int8.engine --minShapes="input_1:0":1x3x544x960 --optShapes="input_1:0":1x3x544x960 --maxShapes="input_1:0":1x3x544x960

We will get the inferencing log like

[08/01/2024-05:40:08] [I] TensorRT version: 8.6.1
[08/01/2024-05:40:08] [I] Loading standard plugins
[08/01/2024-05:40:08] [I] [TRT] [MemUsageChange] Init CUDA: CPU +2, GPU +0, now: CPU 19, GPU 3216 (MiB)
[08/01/2024-05:40:16] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +1444, GPU +281, now: CPU 1540, GPU 3499 (MiB)
[08/01/2024-05:40:16] [I] Start parsing network model.
[08/01/2024-05:40:16] [I] [TRT] ----------------------------------------------------------------
[08/01/2024-05:40:16] [I] [TRT] Input filename:   ./resnet34_peoplenet_int8.onnx
[08/01/2024-05:40:16] [I] [TRT] ONNX IR version:  0.0.7
[08/01/2024-05:40:16] [I] [TRT] Opset version:    12
[08/01/2024-05:40:16] [I] [TRT] Producer name:    tf2onnx
[08/01/2024-05:40:16] [I] [TRT] Producer version: 1.9.2
[08/01/2024-05:40:16] [I] [TRT] Domain:           
[08/01/2024-05:40:16] [I] [TRT] Model version:    0
[08/01/2024-05:40:16] [I] [TRT] Doc string:       
[08/01/2024-05:40:16] [I] [TRT] ----------------------------------------------------------------
[08/01/2024-05:40:16] [I] Finished parsing network model. Parse time: 0.0328782
[08/01/2024-05:40:16] [I] FP32 and INT8 precisions have been specified - more performance might be enabled by additionally specifying --fp16 or --best
[08/01/2024-05:40:16] [I] [TRT] Graph optimization time: 0.00729083 seconds.
[08/01/2024-05:40:16] [I] [TRT] Reading Calibration Cache for calibrator: EntropyCalibration2
[08/01/2024-05:40:16] [I] [TRT] Generated calibration scales using calibration cache. Make sure that calibration cache has latest scales.
[08/01/2024-05:40:16] [I] [TRT] To regenerate calibration cache, please delete the existing one. TensorRT will generate a new calibration cache.
[08/01/2024-05:40:17] [I] [TRT] Graph optimization time: 0.11371 seconds.
[08/01/2024-05:40:17] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[08/01/2024-05:41:11] [I] [TRT] Detected 1 inputs and 2 output network tensors.
[08/01/2024-05:41:11] [I] [TRT] Total Host Persistent Memory: 250448
[08/01/2024-05:41:11] [I] [TRT] Total Device Persistent Memory: 0
[08/01/2024-05:41:11] [I] [TRT] Total Scratch Memory: 0
[08/01/2024-05:41:11] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 10 MiB, GPU 32 MiB
[08/01/2024-05:41:11] [I] [TRT] [BlockAssignment] Started assigning block shifts. This will take 51 steps to complete.
[08/01/2024-05:41:11] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 0.908891ms to assign 4 blocks to 51 nodes requiring 8551936 bytes.
[08/01/2024-05:41:11] [I] [TRT] Total Activation Memory: 8551936
[08/01/2024-05:41:11] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +2, GPU +4, now: CPU 2, GPU 4 (MiB)
[08/01/2024-05:41:11] [I] Engine built in 63.0727 sec.
[08/01/2024-05:41:11] [I] [TRT] Loaded engine size: 5 MiB
[08/01/2024-05:41:11] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +2, now: CPU 0, GPU 2 (MiB)
[08/01/2024-05:41:11] [I] Engine deserialized in 0.0335891 sec.
[08/01/2024-05:41:11] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +9, now: CPU 0, GPU 11 (MiB)
[08/01/2024-05:41:11] [I] Setting persistentCacheLimit to 0 bytes.
[08/01/2024-05:41:11] [I] Using random values for input input_1:0
[08/01/2024-05:41:11] [I] Input binding for input_1:0 with dimensions 1x3x544x960 is created.
[08/01/2024-05:41:11] [I] Output binding for output_cov/Sigmoid:0 with dimensions 1x3x34x60 is created.
[08/01/2024-05:41:11] [I] Output binding for output_bbox/BiasAdd:0 with dimensions 1x12x34x60 is created.
[08/01/2024-05:41:11] [I] Starting inference
[08/01/2024-05:41:14] [I] Warmup completed 277 queries over 200 ms
[08/01/2024-05:41:14] [I] Timing trace has 4401 queries over 3.00231 s
[08/01/2024-05:41:14] [I] 
[08/01/2024-05:41:14] [I] === Trace details ===
[08/01/2024-05:41:14] [I] Trace averages of 10 runs:
[08/01/2024-05:41:14] [I] Average on 10 runs - GPU latency: 0.518755 ms - Host latency: 1.06429 ms (enqueue 0.381085 ms)
[08/01/2024-05:41:14] [I] Average on 10 runs - GPU latency: 0.517635 ms - Host latency: 1.06313 ms (enqueue 0.38203 ms)
[08/01/2024-05:41:14] [I] Average on 10 runs - GPU latency: 0.507956 ms - Host latency: 1.05537 ms (enqueue 0.427177 ms)
[08/01/2024-05:41:14] [I] Average on 10 runs - GPU latency: 0.520602 ms - Host latency: 1.06481 ms (enqueue 0.40968 ms)
[08/01/2024-05:41:14] [I] Average on 10 runs - GPU latency: 0.51886 ms - Host latency: 1.06459 ms (enqueue 0.414577 ms)
[08/01/2024-05:41:14] [I] Average on 10 runs - GPU latency: 0.514253 ms - Host latency: 1.06079 ms (enqueue 0.431619 ms)
[08/01/2024-05:41:14] [I] Average on 10 runs - GPU latency: 0.517529 ms - Host latency: 1.06322 ms (enqueue 0.410648 ms)
...

adithya.ajith · August 1, 2024, 7:09am

My models are .etlt file does “trtexec” support engine file creation for .etlt files ?
Saw a forum post saying it does’nt

Fiona.Chen · August 1, 2024, 7:14am

After you run with DeepStream app, there is already engine file generated. You can run “trtexec” with the engine file directly.
E.G.

trtexec --loadEngine=sample.engine

adithya.ajith · August 1, 2024, 7:24am

Unpruned and pruned model’s inference logs respectively
unpruned_infer.log (13.6 KB)
pruned_infer.log (15.5 KB)

Fiona.Chen · August 1, 2024, 7:39am

The pruned model’s perf may reach around 190 ~ 200 FPS.

Have you monitored the GPU loading when run the python app with pruned model?

The command is “nvidia-smi dmon”.

adithya.ajith · August 1, 2024, 7:43am

Can you explain how you calculated that the model perf can reach ~200 FPS.
Thanks

Fiona.Chen · August 1, 2024, 7:53am

[08/01/2024-12:52:30] [I] === Performance summary ===
[08/01/2024-12:52:30] [I] Throughput: 225.012 qps
[08/01/2024-12:52:30] [I] Latency: min = 5.40283 ms, max = 5.73297 ms, mean = 5.4537 ms, median = 5.4563 ms, percentile(90%) = 5.46387 ms, percentile(95%) = 5.4823 ms, percentile(99%) = 5.49117 ms
[08/01/2024-12:52:30] [I] Enqueue Time: min = 0.217285 ms, max = 1.97723 ms, mean = 1.04071 ms, median = 1.0918 ms, percentile(90%) = 1.1228 ms, percentile(95%) = 1.53406 ms, percentile(99%) = 1.9436 ms
[08/01/2024-12:52:30] [I] H2D Latency: min = 1.00256 ms, max = 1.03442 ms, mean = 1.00881 ms, median = 1.00854 ms, percentile(90%) = 1.0105 ms, percentile(95%) = 1.01294 ms, percentile(99%) = 1.02002 ms
[08/01/2024-12:52:30] [I] GPU Compute Time: min = 4.38373 ms, max = 4.71448 ms, mean = 4.43574 ms, median = 4.43896 ms, percentile(90%) = 4.44519 ms, percentile(95%) = 4.46667 ms, percentile(99%) = 4.47385 ms
[08/01/2024-12:52:30] [I] D2H Latency: min = 0.00610352 ms, max = 0.0116882 ms, mean = 0.00914481 ms, median = 0.0090332 ms, percentile(90%) = 0.0101929 ms, percentile(95%) = 0.010376 ms, percentile(99%) = 0.0108032 ms
[08/01/2024-12:52:30] [I] Total Host Walltime: 3.01318 s
[08/01/2024-12:52:30] [I] Total GPU Compute Time: 3.00743 s

compute time + H2D/ D2H latencies is the rough time of processing one frame.

Fiona.Chen · August 1, 2024, 7:54am

Please monitored the GPU loading when run the python app with pruned model.

Fiona.Chen · August 1, 2024, 7:59am

How did you calculate the 130 value?

Topic		Replies	Views
Lack of FPS after successfully deploy TLT to Deepstream. DeepStream SDK	18	1004	April 27, 2020
Increase the FPS DeepStream SDK	25	1253	April 17, 2024
How to determine the maximum number of inferences a gpu can make? DeepStream SDK deepstream	58	184	November 29, 2024
GPU frame rate maxes when the GPU util isn't at max DeepStream SDK	6	998	November 9, 2021
Low FPS, randomness RTSP Stream DeepStream SDK	12	1016	July 20, 2022
Deepstreamer Pipeline: Optimisation GPU Utilisation DeepStream SDK gstreamer , fps , deepstream	22	89	December 12, 2024
Lower FPS for engine file with higher batch size vs engine file with lower batch size TAO Toolkit	33	84	August 23, 2024
Dropping the FPS of input rtsp streams DeepStream SDK rtsp , fps , deepstream	3	33	March 10, 2025
Deepstream pose Estimation Output log "Killed" DeepStream SDK deep-learning	25	1461	October 12, 2021
Deepstream_facelandmark.app faster? DeepStream SDK deepstream	12	1194	March 28, 2023

Low FPS for pruned tao toolkit models on deepstream

Related topics