Increase the FPS

• Hardware Platform (Jetson / GPU) NVIDIA A2
• DeepStream Version 6.3
• TensorRT Version 8.4.0
• NVIDIA GPU Driver Version (valid for GPU only) 535.129.03

Hello,

I am doing some experiences on deepstream_test3.py. I can see that when I use only 1 video, the FPS is around 85 ~ 88, and my GPU Memory-Usage 2602MiB / 15356MiB.

Screenshot 2024-03-14 113241

When I use 30 videos, the FPS goes down to 3.4 ~ 3.8, and my GPU Memory-Usage 5284MiB / 15356MiB.

I think when I run 30 videos, the FPS will remain the same and there will be more effort done by the GPU.

  • What causes this drop in FPS?
  • How can I increase the performance to improve the FPS?

I appreciate your help.

please refer to topic for performance improvement.

Hello @fanzh

I have been trying these suggestions. But unfortunately, I am getting the same result FPS no matter what I change.

I feel like there’s another bottleneck causing this issue. Do you have any idea?

  1. what is your start command-line? what are the resolution and fps of the file?
    2.did you modify the deepstream_test3.py code? if yes, please share the diff.
  2. could you share the result of " deepstream-app -c source30_1080p_dec_infer-resnet_tiled_display_int8.yml"? it will test 30 streams inference. the path is /opt/nvidia/deepstream/deepstream/samples/configs/deepstream-app.

@fanzh

  1. I run the pipe line using Python3 command in a docker file. Let me share it with you.
ENTRYPOINT [ "python3", "run.py", "--stream_paths", "/opt/nvidia/deepstream/deepstream-6.4/sources/inference/configs/streams/streams.json",  "--pgie", "nvinferserver", "--config", "/opt/nvidia/deepstream/deepstream-6.4/samples/triton_model_repo/peoplenet/config_triton_infer_primary_peoplenet.txt"]

Resolution and FPS are: 1920 x1080 and 60 fps respectively.

  1. Yes, I modified deepstream_test3.py code. Since it’s completely different, I’ll share the code in a DM.

About that command: deepstream-app -c source30_1080p_dec_infer-resnet_tiled_display_int8.yml I got this:

** ERROR: <main:733>: Could not open X Display
Quitting
nvstreammux: Successfully handled EOS for source_id=0
nvstreammux: Successfully handled EOS for source_id=1
nvstreammux: Successfully handled EOS for source_id=2
nvstreammux: Successfully handled EOS for source_id=3
nvstreammux: Successfully handled EOS for source_id=4
nvstreammux: Successfully handled EOS for source_id=5
nvstreammux: Successfully handled EOS for source_id=6
nvstreammux: Successfully handled EOS for source_id=7
nvstreammux: Successfully handled EOS for source_id=8
nvstreammux: Successfully handled EOS for source_id=9
nvstreammux: Successfully handled EOS for source_id=10
nvstreammux: Successfully handled EOS for source_id=11
nvstreammux: Successfully handled EOS for source_id=12
nvstreammux: Successfully handled EOS for source_id=13
nvstreammux: Successfully handled EOS for source_id=14
nvstreammux: Successfully handled EOS for source_id=15
nvstreammux: Successfully handled EOS for source_id=16
nvstreammux: Successfully handled EOS for source_id=17
nvstreammux: Successfully handled EOS for source_id=18
nvstreammux: Successfully handled EOS for source_id=19
nvstreammux: Successfully handled EOS for source_id=20
nvstreammux: Successfully handled EOS for source_id=21
nvstreammux: Successfully handled EOS for source_id=22
nvstreammux: Successfully handled EOS for source_id=23
nvstreammux: Successfully handled EOS for source_id=24
nvstreammux: Successfully handled EOS for source_id=25
nvstreammux: Successfully handled EOS for source_id=26
nvstreammux: Successfully handled EOS for source_id=27
nvstreammux: Successfully handled EOS for source_id=28
nvstreammux: Successfully handled EOS for source_id=29
App run failed
  1. what is the encode format? if H264, A10 can’t support 30 streams with 1080p, 60fps. please refer to this link. the decoding performance of A2 is similar to A10.
  2. please set type to 1 in source30_1080p_dec_infer-resnet_tiled_display_int8.yml, which means using fakesink. then please share the result of " deepstream-app -c source30_1080p_dec_infer-resnet_tiled_display_int8.yml".

Yes, H264.
What do you mean that A10 can’t support 30 streams with 1080p, 60fps?

You can find the result attached here:
1.txt (19.5 MB)

In the link mentioned in my last comment, you can see A10 can support 37 streams with 1080p, 30fps. A2 and A10 have similar decoding capability. so A2 maybe can’t support 30 streams with 1080p, 60fps. to confirm this, you can set source to your test file in source30_1080p_dec_infer-resnet_tiled_display_int8.yml, then check the test fps.
In the log 1.txt, the fps of each stream is always 29~30. namely, A2 can support the decoding and inference of 30 streams with 1080p, 30fps.

I have no problem with 30 fps. But, when I run my pipeline with 30 streams, my fps is around 2.0 (When I turn off Kafka) while having two sgies in the pipeline.

When I remove the sgies, I get 19~20 fps. I never reached 30 fps.

  1. what do you mean about “I have no problem with 30 fps.”? do you mean testing 30 streams with 1080p, 60fps in source30_1080p_dec_infer-resnet_tiled_display_int8.yml, each stream can reached 30fps?
  2. as I mentioned above, A2 maybe can’t support 30 streams with 1080p, 60fps. can you try 10 or 20 streams first?
  3. about fps 2 issue, please refer to this topic for “Enable Latency measurement for deepstream sample apps”, then you can know which plugin is consuming too much time.

1- That I would like to have 30 fps in my pipeline.
2- With 6 streams 1080p, 60fps, the pipeline is 10~12 fps.
3- Yes, I did that. The weird thing is, sometimes the bottleneck is “nvstreammux-Stream-muxer” and sometimes it is the “primary-inference” and some other times it is “secondary-inference” in the same run :')

@fanzh
I want to share with you the result of my experiment on deepstream_test3.py.

GPU: A2
inference type: nvinferserver
Detector: Peoplenet of batch size 64
Number of streams: 18 (1080p, 60fps)
GPU and decoder utilization:

FPS:

Config of pgie:
config_triton_infer_primary_peoplenet.txt (1.2 KB)

  1. currently I am unable to reproduce this low fps issue by deepstream_test3.py. please refer to my test details.
    gpu rtx 6000.
    DeepStram version: nvcr.io/nvidia/deepstream:6.4-triton-multiarch
    inference type: nvinferserver
    Detector: Peoplenet of batch size 30
    Number of streams: 20(1080p, 60fps)
    steps:
    1> download peoplenet model, generate TRT engine for triton by this command-line:
trtexec --onnx=./triton_model_repo/peoplenet/resnet34_peoplenet_int8.onnx --int8 --calib=./models/triton_model_repo/resnet34_peoplenet_int8.txt --saveEngine=./triton_model_repo/peoplenet/1/resnet34_peoplenet_int8.onnx_b1_gpu0_int8.engine --minShapes="input_1:0":1x3x544x960 --optShapes="input_1:0":30x3x544x960 --maxShapes="input_1:0":30x3x544x960

2> only set max_batch_size: 30 in config.pbtxt. only set max_batch_size: 30 in config_triton_infer_primary_peoplenet.txt.
3> start test. here is the whole loglog0402.txt (16.0 KB). the fps is about 129. you can try the same start command-line.
2. did you modify deepstream_test3.py? if yes, please share the diff.
3. if fps still is low, please share the results of 3 streams(1080p, 60fps).

Thank you @fanzh for your response.
I did this experiment you said, and here are the results:

Model: PeopleNet, converted and quantized, 30b

trtexec --onnx=./resnet34_peoplenet_int8.onnx --int8 --calib=./resnet34_peoplenet_int8.txt --saveEngine=./1/quantized_peoplenet.engine --minShapes="input_1:0":1x3x544x960 --optShapes="input_1:0":30x3x544x960 --maxShapes="input_1:0":30x3x544x960

config files:

config.txt.txt (1.4 KB)
config_triton_infer_primary_peoplenet.txt (1.2 KB)

FIRST EXPERIMENT

Pipeline: deepstream_test_3 (With no change)
Number of streams: 20
Dataset: Wildtrack
Resolution and fps: 1080p, 60fps
Infer type: nvinferserver
Log file:
20.log (17.0 KB)

SECOND EXPERIMENT

Pipeline: deepstream_test_3 (With no change)
Number of streams: 3
Dataset: Wildtrack
Resolution and fps: 1080p, 60fps
Infer type: nvinferserver
Log file:
3.log (4.1 KB)

I don’t really know why when I run 20 streams, I get noticeably lower fps than 3 streams.
We are building a solution that should work on 30~60 streams using A2. Is it applicable?

Thank you!

the total fps is equal to the number of streams times the fps of each stream.
In the latest test, the fps of 20 steams is 26. which is better than the test on Apr 1(7fps).

yes, please refer to my last test. if using 60 streams, please modify batch 30 to 60. and please use the following method to check if there is decoding and inference bottleneck.

  1. remove nvinferserver plugin to only test decoding. when the decoding utilization is approaches 100%, that fps should be the max decoding fps.
  2. you can use “trtexec --loadEngine=saved.engine --fp16” to get the max inference fps in theory. so the overall fps is less than the max decoding fps and inference fps.

Thank you for your replies @fanzh

Can you elaborate?

Yes, I think that’s because I was not using the calibration file while converting.

I unlinked the pgie (nvinferserver) from deepstream test 3 pipeline. The decoder utilization was 97~99%.

Screenshot 2024-04-05 030957

As soon as I linked the pgie back again, the utilization became 16~30%

Screenshot 2024-04-05 031037

I don’t really get this point. what do you mean with it?

In the test of “20.log”, the fps of each stream is 25, the total fps should be 25x20=450, the fps of each stream is 126, the total fps is 126x3=378.

from the test, decoder utilization can’t reach 100% because inference can’t process the frames as soon as possible. inference is the bottleneck. can you share the two results of “trtexec --loadEngine=saved.engine --fp16”? one “saved.engine” is generated by batch30, the other is generated by batch60. from the results, you will get some theoretical inference performance. please refer to this topic.

Do you mean using this command on the model I already generated before with the following command:

trtexec --onnx=./resnet34_peoplenet_int8.onnx --int8 --calib=./resnet34_peoplenet_int8.txt --saveEngine=./1/quantized_peoplenet.engine --minShapes="input_1:0":1x3x544x960 --optShapes="input_1:0":30x3x544x960 --maxShapes="input_1:0":30x3x544x960

yes. please also test the batch60. wondering if the performance will improve when batch-size increases.

Sorry for late response.

So, you want me to convert & test the model with batch size 60?