Increase the FPS

JoeShz · March 14, 2024, 9:41am

• Hardware Platform (Jetson / GPU) NVIDIA A2
• DeepStream Version 6.3
• TensorRT Version 8.4.0
• NVIDIA GPU Driver Version (valid for GPU only) 535.129.03

Hello,

I am doing some experiences on deepstream_test3.py. I can see that when I use only 1 video, the FPS is around 85 ~ 88, and my GPU Memory-Usage 2602MiB / 15356MiB.

Screenshot 2024-03-14 113241

When I use 30 videos, the FPS goes down to 3.4 ~ 3.8, and my GPU Memory-Usage 5284MiB / 15356MiB.

I think when I run 30 videos, the FPS will remain the same and there will be more effort done by the GPU.

What causes this drop in FPS?
How can I increase the performance to improve the FPS?

I appreciate your help.

fanzh · March 15, 2024, 5:27am

please refer to topic for performance improvement.

JoeShz · March 26, 2024, 8:35am

Hello @fanzh

I have been trying these suggestions. But unfortunately, I am getting the same result FPS no matter what I change.

I feel like there’s another bottleneck causing this issue. Do you have any idea?

fanzh · March 26, 2024, 8:46am

what is your start command-line? what are the resolution and fps of the file?
2.did you modify the deepstream_test3.py code? if yes, please share the diff.
could you share the result of " deepstream-app -c source30_1080p_dec_infer-resnet_tiled_display_int8.yml"? it will test 30 streams inference. the path is /opt/nvidia/deepstream/deepstream/samples/configs/deepstream-app.

JoeShz · March 26, 2024, 10:53am

@fanzh

I run the pipe line using Python3 command in a docker file. Let me share it with you.

ENTRYPOINT [ "python3", "run.py", "--stream_paths", "/opt/nvidia/deepstream/deepstream-6.4/sources/inference/configs/streams/streams.json",  "--pgie", "nvinferserver", "--config", "/opt/nvidia/deepstream/deepstream-6.4/samples/triton_model_repo/peoplenet/config_triton_infer_primary_peoplenet.txt"]

Resolution and FPS are: 1920 x1080 and 60 fps respectively.

Yes, I modified deepstream_test3.py code. Since it’s completely different, I’ll share the code in a DM.

About that command: deepstream-app -c source30_1080p_dec_infer-resnet_tiled_display_int8.yml I got this:

** ERROR: <main:733>: Could not open X Display
Quitting
nvstreammux: Successfully handled EOS for source_id=0
nvstreammux: Successfully handled EOS for source_id=1
nvstreammux: Successfully handled EOS for source_id=2
nvstreammux: Successfully handled EOS for source_id=3
nvstreammux: Successfully handled EOS for source_id=4
nvstreammux: Successfully handled EOS for source_id=5
nvstreammux: Successfully handled EOS for source_id=6
nvstreammux: Successfully handled EOS for source_id=7
nvstreammux: Successfully handled EOS for source_id=8
nvstreammux: Successfully handled EOS for source_id=9
nvstreammux: Successfully handled EOS for source_id=10
nvstreammux: Successfully handled EOS for source_id=11
nvstreammux: Successfully handled EOS for source_id=12
nvstreammux: Successfully handled EOS for source_id=13
nvstreammux: Successfully handled EOS for source_id=14
nvstreammux: Successfully handled EOS for source_id=15
nvstreammux: Successfully handled EOS for source_id=16
nvstreammux: Successfully handled EOS for source_id=17
nvstreammux: Successfully handled EOS for source_id=18
nvstreammux: Successfully handled EOS for source_id=19
nvstreammux: Successfully handled EOS for source_id=20
nvstreammux: Successfully handled EOS for source_id=21
nvstreammux: Successfully handled EOS for source_id=22
nvstreammux: Successfully handled EOS for source_id=23
nvstreammux: Successfully handled EOS for source_id=24
nvstreammux: Successfully handled EOS for source_id=25
nvstreammux: Successfully handled EOS for source_id=26
nvstreammux: Successfully handled EOS for source_id=27
nvstreammux: Successfully handled EOS for source_id=28
nvstreammux: Successfully handled EOS for source_id=29
App run failed

fanzh · March 27, 2024, 12:40pm

what is the encode format? if H264, A10 can’t support 30 streams with 1080p, 60fps. please refer to this link. the decoding performance of A2 is similar to A10.
please set type to 1 in source30_1080p_dec_infer-resnet_tiled_display_int8.yml, which means using fakesink. then please share the result of " deepstream-app -c source30_1080p_dec_infer-resnet_tiled_display_int8.yml".

JoeShz · March 28, 2024, 1:34pm

Yes, H264.
What do you mean that A10 can’t support 30 streams with 1080p, 60fps?

You can find the result attached here:
1.txt (19.5 MB)

fanzh · March 31, 2024, 2:46pm

In the link mentioned in my last comment, you can see A10 can support 37 streams with 1080p, 30fps. A2 and A10 have similar decoding capability. so A2 maybe can’t support 30 streams with 1080p, 60fps. to confirm this, you can set source to your test file in source30_1080p_dec_infer-resnet_tiled_display_int8.yml, then check the test fps.
In the log 1.txt, the fps of each stream is always 29~30. namely, A2 can support the decoding and inference of 30 streams with 1080p, 30fps.

JoeShz · April 1, 2024, 8:28am

I have no problem with 30 fps. But, when I run my pipeline with 30 streams, my fps is around 2.0 (When I turn off Kafka) while having two sgies in the pipeline.

When I remove the sgies, I get 19~20 fps. I never reached 30 fps.

fanzh · April 1, 2024, 9:59am

what do you mean about “I have no problem with 30 fps.”? do you mean testing 30 streams with 1080p, 60fps in source30_1080p_dec_infer-resnet_tiled_display_int8.yml, each stream can reached 30fps?
as I mentioned above, A2 maybe can’t support 30 streams with 1080p, 60fps. can you try 10 or 20 streams first?
about fps 2 issue, please refer to this topic for “Enable Latency measurement for deepstream sample apps”, then you can know which plugin is consuming too much time.

JoeShz · April 1, 2024, 11:54am

1- That I would like to have 30 fps in my pipeline.
2- With 6 streams 1080p, 60fps, the pipeline is 10~12 fps.
3- Yes, I did that. The weird thing is, sometimes the bottleneck is “nvstreammux-Stream-muxer” and sometimes it is the “primary-inference” and some other times it is “secondary-inference” in the same run :')

JoeShz · April 1, 2024, 1:15pm

@fanzh
I want to share with you the result of my experiment on deepstream_test3.py.

GPU: A2
inference type: nvinferserver
Detector: Peoplenet of batch size 64
Number of streams: 18 (1080p, 60fps)
GPU and decoder utilization:

FPS:

Config of pgie:
config_triton_infer_primary_peoplenet.txt (1.2 KB)

fanzh · April 2, 2024, 1:14pm

currently I am unable to reproduce this low fps issue by deepstream_test3.py. please refer to my test details.
gpu rtx 6000.
DeepStram version: nvcr.io/nvidia/deepstream:6.4-triton-multiarch
inference type: nvinferserver
Detector: Peoplenet of batch size 30
Number of streams: 20(1080p, 60fps)
steps:
1> download peoplenet model, generate TRT engine for triton by this command-line:

trtexec --onnx=./triton_model_repo/peoplenet/resnet34_peoplenet_int8.onnx --int8 --calib=./models/triton_model_repo/resnet34_peoplenet_int8.txt --saveEngine=./triton_model_repo/peoplenet/1/resnet34_peoplenet_int8.onnx_b1_gpu0_int8.engine --minShapes="input_1:0":1x3x544x960 --optShapes="input_1:0":30x3x544x960 --maxShapes="input_1:0":30x3x544x960

2> only set max_batch_size: 30 in config.pbtxt. only set max_batch_size: 30 in config_triton_infer_primary_peoplenet.txt.
3> start test. here is the whole loglog0402.txt (16.0 KB). the fps is about 129. you can try the same start command-line.
2. did you modify deepstream_test3.py? if yes, please share the diff.
3. if fps still is low, please share the results of 3 streams(1080p, 60fps).

JoeShz · April 4, 2024, 12:55pm

Thank you @fanzh for your response.
I did this experiment you said, and here are the results:

Model: PeopleNet, converted and quantized, 30b

trtexec --onnx=./resnet34_peoplenet_int8.onnx --int8 --calib=./resnet34_peoplenet_int8.txt --saveEngine=./1/quantized_peoplenet.engine --minShapes="input_1:0":1x3x544x960 --optShapes="input_1:0":30x3x544x960 --maxShapes="input_1:0":30x3x544x960

config files:

config.txt.txt (1.4 KB)
config_triton_infer_primary_peoplenet.txt (1.2 KB)

FIRST EXPERIMENT

Pipeline: deepstream_test_3 (With no change)
Number of streams: 20
Dataset: Wildtrack
Resolution and fps: 1080p, 60fps
Infer type: nvinferserver
Log file:
20.log (17.0 KB)

SECOND EXPERIMENT

Pipeline: deepstream_test_3 (With no change)
Number of streams: 3
Dataset: Wildtrack
Resolution and fps: 1080p, 60fps
Infer type: nvinferserver
Log file:
3.log (4.1 KB)

I don’t really know why when I run 20 streams, I get noticeably lower fps than 3 streams.
We are building a solution that should work on 30~60 streams using A2. Is it applicable?

Thank you!

fanzh · April 4, 2024, 2:00pm

the total fps is equal to the number of streams times the fps of each stream.
In the latest test, the fps of 20 steams is 26. which is better than the test on Apr 1(7fps).

yes, please refer to my last test. if using 60 streams, please modify batch 30 to 60. and please use the following method to check if there is decoding and inference bottleneck.

remove nvinferserver plugin to only test decoding. when the decoding utilization is approaches 100%, that fps should be the max decoding fps.
you can use “trtexec --loadEngine=saved.engine --fp16” to get the max inference fps in theory. so the overall fps is less than the max decoding fps and inference fps.

JoeShz · April 5, 2024, 1:11am

Thank you for your replies @fanzh

Can you elaborate?

Yes, I think that’s because I was not using the calibration file while converting.

I unlinked the pgie (nvinferserver) from deepstream test 3 pipeline. The decoder utilization was 97~99%.

Screenshot 2024-04-05 030957

As soon as I linked the pgie back again, the utilization became 16~30%

Screenshot 2024-04-05 031037

I don’t really get this point. what do you mean with it?

fanzh · April 5, 2024, 4:08am

In the test of “20.log”, the fps of each stream is 25, the total fps should be 25x20=450, the fps of each stream is 126, the total fps is 126x3=378.

from the test, decoder utilization can’t reach 100% because inference can’t process the frames as soon as possible. inference is the bottleneck. can you share the two results of “trtexec --loadEngine=saved.engine --fp16”? one “saved.engine” is generated by batch30, the other is generated by batch60. from the results, you will get some theoretical inference performance. please refer to this topic.

JoeShz · April 5, 2024, 8:51am

Do you mean using this command on the model I already generated before with the following command:

trtexec --onnx=./resnet34_peoplenet_int8.onnx --int8 --calib=./resnet34_peoplenet_int8.txt --saveEngine=./1/quantized_peoplenet.engine --minShapes="input_1:0":1x3x544x960 --optShapes="input_1:0":30x3x544x960 --maxShapes="input_1:0":30x3x544x960

fanzh · April 5, 2024, 12:17pm

yes. please also test the batch60. wondering if the performance will improve when batch-size increases.

JoeShz · April 15, 2024, 11:23am

Sorry for late response.

So, you want me to convert & test the model with batch size 60?

Topic		Replies	Views
Performance drop when using multiple sources DeepStream SDK	27	996	April 29, 2024
Deepstream 5.1 inference caps at 30fps DeepStream SDK	10	725	November 23, 2021
Deepstream deployment of yolov3 with low FPS DeepStream SDK	19	1232	October 12, 2021
GPU utilization rate and FPS are very low in deepstream samples DeepStream SDK	3	658	October 12, 2021
How to determine the maximum number of inferences a gpu can make? DeepStream SDK deepstream	58	279	November 29, 2024
Optimising a pipeline with a large amount of input sources DeepStream SDK	7	457	March 8, 2022
Low fps rate and low volatile gpu-util running mutiple streams DeepStream SDK	9	717	May 6, 2023
Deepstream yolov4 process multiple streams is slow DeepStream SDK	7	1389	November 30, 2021
Low performance when running pipeline with RTX 4090 DeepStream SDK	24	628	March 21, 2024
Deepstream performance with 6 rtsp cameras DeepStream SDK hw , gstreamer	16	585	October 12, 2021

Increase the FPS

Related topics