Low FPS for pruned tao toolkit models on deepstream

• Hardware Platform NVIDIA RTX A5000
• DeepStream Version 6.3
• TensorRT Version 8.5.1-1+cuda11.8
• NVIDIA GPU Driver Version (valid for GPU only) 535.183.01

Ran 30 1080p rtsp streams with a tao yolov4 model (tao toolkit 3.22.05) with input size (1888*1056) on a deepstream based python application made with references from deepstream sample python applications, the max FPS we were able to extract for the set of cameras was 130.
To improve the FPS, the model was pruned using tao’s prune command and the subsequent pruned model had a ratio of 0.57 to the unpruned model. This model when ran on the application with the same setup gave an FPS of 150. This is not a significant improvement as we had hoped. What could be the reason behind this and can we expect to improve our FPS significantly with respect to pruning the model ?

Have you measured the performance of the model with trtexec tool?

No, our application saves frames to the disk. The FPS was calculated based on this.

Please test the model performance with trtexec first, we need to know whether the model is the bottleneck or not.

And please provide the pipeline and the configurations you used to test the FPS, it is important to set the correct parameters.


Above is the pipeline for the deepstream part.
This is the unpruned model’s nvinfer
pgie_d26_apr1924_apm_fframe_yolov4_resnet18_epoch_045_drop8.txt (4.4 KB)
This is the pruned model’s nvinfer
pgie_d26_apr1924_yolov4_resnet18_epoch_045_pruned_5_e046.txt (4.0 KB)

I will share the profiling data shortly for both the models.

You have added videorate in the pipeline, this will control the FPS of the pipeline. Please remove the videorate since you are using local file. You just need to set “sync=false” with your sink element to make the pipeline run ASAP.

Source will be rtsp feeds, unlike the local file as seen. Please refer the following diagram for looking at the sources

Following are the profiling data for the pruned and unpruned models respectively.
profile_pruned.txt (20.2 KB)
profile_unpruned.txt (19.8 KB)

We don’t need the profiling data. Please provide “trtexec” inferencing log.

Can you be specific about what you mean by “inferencing log” in the trtexec command’s reporting options.

Take the PeopleNet | NVIDIA NGC as the example, download the deployable_quantized_onnx_v2.6.2 version onnx model and run the “trtexec” build and inferencing command
trtexec --onnx=./resnet34_peoplenet_int8.onnx --int8 --calib=./resnet34_peoplenet_int8.txt --saveEngine=./resnet34_peoplenet_int8.onnx_b1_gpu0_int8.engine --minShapes="input_1:0":1x3x544x960 --optShapes="input_1:0":1x3x544x960 --maxShapes="input_1:0":1x3x544x960

We will get the inferencing log like

[08/01/2024-05:40:08] [I] TensorRT version: 8.6.1
[08/01/2024-05:40:08] [I] Loading standard plugins
[08/01/2024-05:40:08] [I] [TRT] [MemUsageChange] Init CUDA: CPU +2, GPU +0, now: CPU 19, GPU 3216 (MiB)
[08/01/2024-05:40:16] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +1444, GPU +281, now: CPU 1540, GPU 3499 (MiB)
[08/01/2024-05:40:16] [I] Start parsing network model.
[08/01/2024-05:40:16] [I] [TRT] ----------------------------------------------------------------
[08/01/2024-05:40:16] [I] [TRT] Input filename:   ./resnet34_peoplenet_int8.onnx
[08/01/2024-05:40:16] [I] [TRT] ONNX IR version:  0.0.7
[08/01/2024-05:40:16] [I] [TRT] Opset version:    12
[08/01/2024-05:40:16] [I] [TRT] Producer name:    tf2onnx
[08/01/2024-05:40:16] [I] [TRT] Producer version: 1.9.2
[08/01/2024-05:40:16] [I] [TRT] Domain:           
[08/01/2024-05:40:16] [I] [TRT] Model version:    0
[08/01/2024-05:40:16] [I] [TRT] Doc string:       
[08/01/2024-05:40:16] [I] [TRT] ----------------------------------------------------------------
[08/01/2024-05:40:16] [I] Finished parsing network model. Parse time: 0.0328782
[08/01/2024-05:40:16] [I] FP32 and INT8 precisions have been specified - more performance might be enabled by additionally specifying --fp16 or --best
[08/01/2024-05:40:16] [I] [TRT] Graph optimization time: 0.00729083 seconds.
[08/01/2024-05:40:16] [I] [TRT] Reading Calibration Cache for calibrator: EntropyCalibration2
[08/01/2024-05:40:16] [I] [TRT] Generated calibration scales using calibration cache. Make sure that calibration cache has latest scales.
[08/01/2024-05:40:16] [I] [TRT] To regenerate calibration cache, please delete the existing one. TensorRT will generate a new calibration cache.
[08/01/2024-05:40:17] [I] [TRT] Graph optimization time: 0.11371 seconds.
[08/01/2024-05:40:17] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[08/01/2024-05:41:11] [I] [TRT] Detected 1 inputs and 2 output network tensors.
[08/01/2024-05:41:11] [I] [TRT] Total Host Persistent Memory: 250448
[08/01/2024-05:41:11] [I] [TRT] Total Device Persistent Memory: 0
[08/01/2024-05:41:11] [I] [TRT] Total Scratch Memory: 0
[08/01/2024-05:41:11] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 10 MiB, GPU 32 MiB
[08/01/2024-05:41:11] [I] [TRT] [BlockAssignment] Started assigning block shifts. This will take 51 steps to complete.
[08/01/2024-05:41:11] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 0.908891ms to assign 4 blocks to 51 nodes requiring 8551936 bytes.
[08/01/2024-05:41:11] [I] [TRT] Total Activation Memory: 8551936
[08/01/2024-05:41:11] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +2, GPU +4, now: CPU 2, GPU 4 (MiB)
[08/01/2024-05:41:11] [I] Engine built in 63.0727 sec.
[08/01/2024-05:41:11] [I] [TRT] Loaded engine size: 5 MiB
[08/01/2024-05:41:11] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +2, now: CPU 0, GPU 2 (MiB)
[08/01/2024-05:41:11] [I] Engine deserialized in 0.0335891 sec.
[08/01/2024-05:41:11] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +9, now: CPU 0, GPU 11 (MiB)
[08/01/2024-05:41:11] [I] Setting persistentCacheLimit to 0 bytes.
[08/01/2024-05:41:11] [I] Using random values for input input_1:0
[08/01/2024-05:41:11] [I] Input binding for input_1:0 with dimensions 1x3x544x960 is created.
[08/01/2024-05:41:11] [I] Output binding for output_cov/Sigmoid:0 with dimensions 1x3x34x60 is created.
[08/01/2024-05:41:11] [I] Output binding for output_bbox/BiasAdd:0 with dimensions 1x12x34x60 is created.
[08/01/2024-05:41:11] [I] Starting inference
[08/01/2024-05:41:14] [I] Warmup completed 277 queries over 200 ms
[08/01/2024-05:41:14] [I] Timing trace has 4401 queries over 3.00231 s
[08/01/2024-05:41:14] [I] 
[08/01/2024-05:41:14] [I] === Trace details ===
[08/01/2024-05:41:14] [I] Trace averages of 10 runs:
[08/01/2024-05:41:14] [I] Average on 10 runs - GPU latency: 0.518755 ms - Host latency: 1.06429 ms (enqueue 0.381085 ms)
[08/01/2024-05:41:14] [I] Average on 10 runs - GPU latency: 0.517635 ms - Host latency: 1.06313 ms (enqueue 0.38203 ms)
[08/01/2024-05:41:14] [I] Average on 10 runs - GPU latency: 0.507956 ms - Host latency: 1.05537 ms (enqueue 0.427177 ms)
[08/01/2024-05:41:14] [I] Average on 10 runs - GPU latency: 0.520602 ms - Host latency: 1.06481 ms (enqueue 0.40968 ms)
[08/01/2024-05:41:14] [I] Average on 10 runs - GPU latency: 0.51886 ms - Host latency: 1.06459 ms (enqueue 0.414577 ms)
[08/01/2024-05:41:14] [I] Average on 10 runs - GPU latency: 0.514253 ms - Host latency: 1.06079 ms (enqueue 0.431619 ms)
[08/01/2024-05:41:14] [I] Average on 10 runs - GPU latency: 0.517529 ms - Host latency: 1.06322 ms (enqueue 0.410648 ms)
...

My models are .etlt file does “trtexec” support engine file creation for .etlt files ?
Saw a forum post saying it does’nt

After you run with DeepStream app, there is already engine file generated. You can run “trtexec” with the engine file directly.
E.G.

trtexec --loadEngine=sample.engine

Unpruned and pruned model’s inference logs respectively
unpruned_infer.log (13.6 KB)
pruned_infer.log (15.5 KB)

The pruned model’s perf may reach around 190 ~ 200 FPS.

Have you monitored the GPU loading when run the python app with pruned model?

The command is “nvidia-smi dmon”.

Can you explain how you calculated that the model perf can reach ~200 FPS.
Thanks

[08/01/2024-12:52:30] [I] === Performance summary ===
[08/01/2024-12:52:30] [I] Throughput: 225.012 qps
[08/01/2024-12:52:30] [I] Latency: min = 5.40283 ms, max = 5.73297 ms, mean = 5.4537 ms, median = 5.4563 ms, percentile(90%) = 5.46387 ms, percentile(95%) = 5.4823 ms, percentile(99%) = 5.49117 ms
[08/01/2024-12:52:30] [I] Enqueue Time: min = 0.217285 ms, max = 1.97723 ms, mean = 1.04071 ms, median = 1.0918 ms, percentile(90%) = 1.1228 ms, percentile(95%) = 1.53406 ms, percentile(99%) = 1.9436 ms
[08/01/2024-12:52:30] [I] H2D Latency: min = 1.00256 ms, max = 1.03442 ms, mean = 1.00881 ms, median = 1.00854 ms, percentile(90%) = 1.0105 ms, percentile(95%) = 1.01294 ms, percentile(99%) = 1.02002 ms
[08/01/2024-12:52:30] [I] GPU Compute Time: min = 4.38373 ms, max = 4.71448 ms, mean = 4.43574 ms, median = 4.43896 ms, percentile(90%) = 4.44519 ms, percentile(95%) = 4.46667 ms, percentile(99%) = 4.47385 ms
[08/01/2024-12:52:30] [I] D2H Latency: min = 0.00610352 ms, max = 0.0116882 ms, mean = 0.00914481 ms, median = 0.0090332 ms, percentile(90%) = 0.0101929 ms, percentile(95%) = 0.010376 ms, percentile(99%) = 0.0108032 ms
[08/01/2024-12:52:30] [I] Total Host Walltime: 3.01318 s
[08/01/2024-12:52:30] [I] Total GPU Compute Time: 3.00743 s

compute time + H2D/ D2H latencies is the rough time of processing one frame.

1 Like

Please monitored the GPU loading when run the python app with pruned model.

How did you calculate the 130 value?