• Hardware Platform NVIDIA RTX A5000 • DeepStream Version 6.3 • TensorRT Version 8.5.1-1+cuda11.8 • NVIDIA GPU Driver Version (valid for GPU only) 535.183.01
Ran 30 1080p rtsp streams with a tao yolov4 model (tao toolkit 3.22.05) with input size (1888*1056) on a deepstream based python application made with references from deepstream sample python applications, the max FPS we were able to extract for the set of cameras was 130.
To improve the FPS, the model was pruned using tao’s prune command and the subsequent pruned model had a ratio of 0.57 to the unpruned model. This model when ran on the application with the same setup gave an FPS of 150. This is not a significant improvement as we had hoped. What could be the reason behind this and can we expect to improve our FPS significantly with respect to pruning the model ?
You have added videorate in the pipeline, this will control the FPS of the pipeline. Please remove the videorate since you are using local file. You just need to set “sync=false” with your sink element to make the pipeline run ASAP.
Take the PeopleNet | NVIDIA NGC as the example, download the deployable_quantized_onnx_v2.6.2 version onnx model and run the “trtexec” build and inferencing command trtexec --onnx=./resnet34_peoplenet_int8.onnx --int8 --calib=./resnet34_peoplenet_int8.txt --saveEngine=./resnet34_peoplenet_int8.onnx_b1_gpu0_int8.engine --minShapes="input_1:0":1x3x544x960 --optShapes="input_1:0":1x3x544x960 --maxShapes="input_1:0":1x3x544x960
We will get the inferencing log like
[08/01/2024-05:40:08] [I] TensorRT version: 8.6.1
[08/01/2024-05:40:08] [I] Loading standard plugins
[08/01/2024-05:40:08] [I] [TRT] [MemUsageChange] Init CUDA: CPU +2, GPU +0, now: CPU 19, GPU 3216 (MiB)
[08/01/2024-05:40:16] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +1444, GPU +281, now: CPU 1540, GPU 3499 (MiB)
[08/01/2024-05:40:16] [I] Start parsing network model.
[08/01/2024-05:40:16] [I] [TRT] ----------------------------------------------------------------
[08/01/2024-05:40:16] [I] [TRT] Input filename: ./resnet34_peoplenet_int8.onnx
[08/01/2024-05:40:16] [I] [TRT] ONNX IR version: 0.0.7
[08/01/2024-05:40:16] [I] [TRT] Opset version: 12
[08/01/2024-05:40:16] [I] [TRT] Producer name: tf2onnx
[08/01/2024-05:40:16] [I] [TRT] Producer version: 1.9.2
[08/01/2024-05:40:16] [I] [TRT] Domain:
[08/01/2024-05:40:16] [I] [TRT] Model version: 0
[08/01/2024-05:40:16] [I] [TRT] Doc string:
[08/01/2024-05:40:16] [I] [TRT] ----------------------------------------------------------------
[08/01/2024-05:40:16] [I] Finished parsing network model. Parse time: 0.0328782
[08/01/2024-05:40:16] [I] FP32 and INT8 precisions have been specified - more performance might be enabled by additionally specifying --fp16 or --best
[08/01/2024-05:40:16] [I] [TRT] Graph optimization time: 0.00729083 seconds.
[08/01/2024-05:40:16] [I] [TRT] Reading Calibration Cache for calibrator: EntropyCalibration2
[08/01/2024-05:40:16] [I] [TRT] Generated calibration scales using calibration cache. Make sure that calibration cache has latest scales.
[08/01/2024-05:40:16] [I] [TRT] To regenerate calibration cache, please delete the existing one. TensorRT will generate a new calibration cache.
[08/01/2024-05:40:17] [I] [TRT] Graph optimization time: 0.11371 seconds.
[08/01/2024-05:40:17] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[08/01/2024-05:41:11] [I] [TRT] Detected 1 inputs and 2 output network tensors.
[08/01/2024-05:41:11] [I] [TRT] Total Host Persistent Memory: 250448
[08/01/2024-05:41:11] [I] [TRT] Total Device Persistent Memory: 0
[08/01/2024-05:41:11] [I] [TRT] Total Scratch Memory: 0
[08/01/2024-05:41:11] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 10 MiB, GPU 32 MiB
[08/01/2024-05:41:11] [I] [TRT] [BlockAssignment] Started assigning block shifts. This will take 51 steps to complete.
[08/01/2024-05:41:11] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 0.908891ms to assign 4 blocks to 51 nodes requiring 8551936 bytes.
[08/01/2024-05:41:11] [I] [TRT] Total Activation Memory: 8551936
[08/01/2024-05:41:11] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +2, GPU +4, now: CPU 2, GPU 4 (MiB)
[08/01/2024-05:41:11] [I] Engine built in 63.0727 sec.
[08/01/2024-05:41:11] [I] [TRT] Loaded engine size: 5 MiB
[08/01/2024-05:41:11] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +2, now: CPU 0, GPU 2 (MiB)
[08/01/2024-05:41:11] [I] Engine deserialized in 0.0335891 sec.
[08/01/2024-05:41:11] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +9, now: CPU 0, GPU 11 (MiB)
[08/01/2024-05:41:11] [I] Setting persistentCacheLimit to 0 bytes.
[08/01/2024-05:41:11] [I] Using random values for input input_1:0
[08/01/2024-05:41:11] [I] Input binding for input_1:0 with dimensions 1x3x544x960 is created.
[08/01/2024-05:41:11] [I] Output binding for output_cov/Sigmoid:0 with dimensions 1x3x34x60 is created.
[08/01/2024-05:41:11] [I] Output binding for output_bbox/BiasAdd:0 with dimensions 1x12x34x60 is created.
[08/01/2024-05:41:11] [I] Starting inference
[08/01/2024-05:41:14] [I] Warmup completed 277 queries over 200 ms
[08/01/2024-05:41:14] [I] Timing trace has 4401 queries over 3.00231 s
[08/01/2024-05:41:14] [I]
[08/01/2024-05:41:14] [I] === Trace details ===
[08/01/2024-05:41:14] [I] Trace averages of 10 runs:
[08/01/2024-05:41:14] [I] Average on 10 runs - GPU latency: 0.518755 ms - Host latency: 1.06429 ms (enqueue 0.381085 ms)
[08/01/2024-05:41:14] [I] Average on 10 runs - GPU latency: 0.517635 ms - Host latency: 1.06313 ms (enqueue 0.38203 ms)
[08/01/2024-05:41:14] [I] Average on 10 runs - GPU latency: 0.507956 ms - Host latency: 1.05537 ms (enqueue 0.427177 ms)
[08/01/2024-05:41:14] [I] Average on 10 runs - GPU latency: 0.520602 ms - Host latency: 1.06481 ms (enqueue 0.40968 ms)
[08/01/2024-05:41:14] [I] Average on 10 runs - GPU latency: 0.51886 ms - Host latency: 1.06459 ms (enqueue 0.414577 ms)
[08/01/2024-05:41:14] [I] Average on 10 runs - GPU latency: 0.514253 ms - Host latency: 1.06079 ms (enqueue 0.431619 ms)
[08/01/2024-05:41:14] [I] Average on 10 runs - GPU latency: 0.517529 ms - Host latency: 1.06322 ms (enqueue 0.410648 ms)
...