Too much frame drop in deepstream pipeline

aniketrohara · December 12, 2024, 6:31pm

hello, with reference to following ticket i am using 2 extra secondary preprocesses after PGIE to provide embeddings to my model’s non image input layer during runtime.

so i wanted to know that though this method worked for me, the FPS dropped to half, when i was using static embeddings without these extra preprocesses i was getting around 14-15 FPS but after including these two it is dropping to 7-8 FPS so can you tell me why is this happening?

NOTE: i observed same drop in FPS even with the given std preprocessing and no additional custom logic

junshengy · December 13, 2024, 3:45am

nvdspreprocess usually does not affect fps, please try the test below, you can get similar results

without nvdspreprocess, run this command line.

 GST_DEBUG=fpsdisplaysink:6 gst-launch-1.0 uridecodebin uri=file:///opt/nvidia/deepstream/deepstream/samples/streams/sample_720p.h264 \ 
! mux.sink_0 nvstreammux name=mux batch-size=1 width=1280 height=720 ! \ 
nvinfer batch-size=1 config-file-path=/opt/nvidia/deepstream/deepstream/sources/apps/sample_apps/deepstream-preprocess-test/config_infer.txt ! nvvideoconvert \ 
! fpsdisplaysink sync=0 video-sink=fakesink

with nvdspreprocess
1.modify /opt/nvidia/deepstream/deepstream/sources/apps/sample_apps/deepstream-preprocess-test/config_preprocess.txt as follows

# [group-0]
# src-ids=0;1
# custom-input-transformation-function=CustomAsyncTransformation
# process-on-roi=1
# roi-params-src-0=300;200;700;800;1300;300;600;700
# roi-params-src-1=860;300;900;500;50;300;500;700

# [group-1]
# src-ids=2
# custom-input-transformation-function=CustomAsyncTransformation
# process-on-roi=1
# roi-params-src-2=50;300;500;700;650;300;500;500;1300;300;600;700

# [group-2]
# src-ids=3
# custom-input-transformation-function=CustomAsyncTransformation
# process-on-roi=0
# draw-roi=0
# roi-params-src-3=0;540;900;500;960;0;900;500

[group-0]
src-ids=0
custom-input-transformation-function=CustomAsyncTransformation
process-on-roi=0

Then run this command line.

GST_DEBUG=fpsdisplaysink:6 gst-launch-1.0 uridecodebin uri=file:///opt/nvidia/deepstream/deepstream/samples/streams/sample_720p.h264 ! mux.sink_0 \
nvstreammux name=mux batch-size=1 width=1280 height=720 ! nvdspreprocess config-file=/opt/nvidia/deepstream/deepstream/sources/apps/sample_apps/deepstream-preprocess-test/config_preprocess.txt ! \
nvinfer input-tensor-meta=1  batch-size=1 config-file-path=/opt/nvidia/deepstream/deepstream/sources/apps/sample_apps/deepstream-preprocess-test/config_infer.txt \ 
! nvvideoconvert ! fpsdisplaysink sync=0 video-sink=fakesink

junshengy · December 13, 2024, 3:47am

Please check whether the GPU/CPU usage increases after adding your preprocess.

You can use Nsight analysis tools for tuning,refer to this FAQ

aniketrohara · December 15, 2024, 7:04pm

hello, as per your advice i tried the test and as you can see in following console output that the fps drops after adding the preprocess

GST_DEBUG=fpsdisplaysink:6 gst-launch-1.0 uridecodebin uri=file:///opt/nvidia/deepstream/deepstream/samples/streams/sample_720p.h264 \
! mux.sink_0 nvstreammux name=mux batch-size=1 width=1280 height=720 ! \
nvinfer batch-size=1 config-file-path=/opt/nvidia/deepstream/deepstream/sources/apps/sample_apps/deepstream-preprocess-test/config_infer.txt ! nvvideoconvert \
! fpsdisplaysink sync=0 video-sink=fakesink
0:01:45.258600729 201547 0xaaaae9d9c8c0 DEBUG fpsdisplaysink fpsdisplaysink.c:373:display_current_fps: Updated max-fps to 731.508459
0:01:45.258679317 201547 0xaaaae9d9c8c0 DEBUG fpsdisplaysink fpsdisplaysink.c:377:display_current_fps: Updated min-fps to 731.508459
0:01:45.759887102 201547 0xaaaae9d9c8c0 DEBUG fpsdisplaysink fpsdisplaysink.c:373:display_current_fps: Updated max-fps to 859.789464
0:01:46.260722140 201547 0xaaaae9d9c8c0 DEBUG fpsdisplaysink fpsdisplaysink.c:373:display_current_fps: Updated max-fps to 906.488186

GST_DEBUG=fpsdisplaysink:6 gst-launch-1.0 uridecodebin uri=file:///opt/nvidia/deepstream/deepstream/samples/streams/sample_720p.h264 ! mux.sink_0
nvstreammux name=mux batch-size=1 width=1280 height=720 ! nvdspreprocess config-file=/opt/nvidia/deepstream/deepstream/sources/apps/sample_apps/deepstream-preprocess-test/config_preprocess.txt !
nvinfer input-tensor-meta=1 batch-size=1 config-file-path=/opt/nvidia/deepstream/deepstream/sources/apps/sample_apps/deepstream-preprocess-test/config_infer.txt \
! nvvideoconvert ! fpsdisplaysink sync=0 video-sink=fakesink
0:01:46.219996797 201794 0xaaaaec92f060 DEBUG fpsdisplaysink fpsdisplaysink.c:373:display_current_fps: Updated max-fps to 637.475991
0:01:46.220042556 201794 0xaaaaec92f060 DEBUG fpsdisplaysink fpsdisplaysink.c:377:display_current_fps: Updated min-fps to 637.475991
0:01:46.720385690 201794 0xaaaaec92f060 DEBUG fpsdisplaysink fpsdisplaysink.c:373:display_current_fps: Updated max-fps to 865.326739
0:01:47.221709589 201794 0xaaaaec92f060 DEBUG fpsdisplaysink fpsdisplaysink.c:373:display_current_fps: Updated max-fps to 917.592245

you mean should i check for the test that you gave or on my own pipeline?

junshengy · December 16, 2024, 2:29am

The second command line is the one that added nvdspreprocess，fps improved after adding preprocessing

aniketrohara · December 19, 2024, 6:18pm

https://drive.google.com/drive/folders/1elgO7nOYJBVjzo1ZrK3fjN-q0WTxZO2L?usp=sharing

this folder contains the report generated using nsys for my deepstream app, i tried to analyse this using the GUI tool but i couldn’t find anything of use, so please if you can tell me where is the bottleneck.

junshengy · December 20, 2024, 2:21am

The first figure shows that you spend a lot of time in the custom tensor, an average of 100ms, and up to 4s when batch_num=1027

In the second figure, when batch_num=1027, a large amount of time is consumed in cudaMemoryCpy in up to 4s, which is usually caused by synchronization waiting.

You need to check why cudaMemorycpy causes blocking.And optimize the custom tensor conversion function, 100+ms is too long

aniketrohara · December 20, 2024, 6:27pm

Thank you very much for your help, in pointing this out but can you also tell me how can i find which custom tensor is causing this as i am using 4 preprocess plugins to prepare custom tensors, one is for my person detection model, one is for recognition model ad other two are for the swap model.

can you provide any guidance on this synchronisation issue because the cudaMemcpy that i have used are used in just copying the 1x512 dimesional fp32 embeddings and copying the modified 128x128 frame in one of the preprocess

junshengy · December 21, 2024, 6:46pm

You can find a lot of examples of nvtxDomainRangePushEx in the code. make track points in your code and use nsight to find related functions.

  nvtxDomainRangePushEx(ctx->nvtx_domain, &eventAttrib);

Please check the CUDA programming manual. This may be because your buffer is not ready.

aniketrohara · December 23, 2024, 11:51am

thanks for your help but i had few more queries actually as per your suggestion i am using 2 sources for input which is causing the fps drop from 30fps to 15fps but if i am using the non-img layer preprocess on single source only then fps are 22 which not much of a drop but then i am facing another problem that all faces are being swapped except every 12th image and also i used a counter in the non-img layer preprocess and custom parser of face swapper model then i observed one thing that

if non-img preprocess is enabled then parser count is 924 and preprocesses count is 462
if non-img preprocess is not enabled then parser count is 462

junshengy · December 24, 2024, 2:32am

This is normal. Processing multiple objects will consume more GPU/CPU time, so please optimize your processing function first.

Sorry I don’t understand what this is, please refer to the above reply to optimize the model and pre-processing

theelitepro0224 · December 24, 2024, 8:37am

Yes, we know this. The point we want to drive home here is that if we use single source then the fps drop is not significant. So we want to go with single source. But with single source we found that every 12 image generated out of swap model is black/white. So we want to know how to resolve this. We want to go with single source only as it seems that it can work and give better performance and not with two sources for each of the pre-process.

We put a static counter in the custom pre-process and custom parser to check how many times it is called. We found that with two pre-process one of image layer and other for non-image layer the custom parser is called twice. Why?

theelitepro0224 · February 4, 2025, 1:52pm

We were able to resolve this by following

fine tuning the RTSP buffers in configuration file
removing the pre-process and adopt second approach as mentioned in How to pass custom input to non image layer of model during runtime

junshengy · February 5, 2025, 3:53am

Glad to hear that, so currently you are using nvinferserver ?

theelitepro0224 · February 10, 2025, 11:38am

No, currently we are using nvinfer only. Will nvinferserver (I guess Triton server) help in improving the performance? What are the other options to improve the performance like

Increasing the batch size?
Implementing architecture like deepstream_reference_apps/deepstream_parallel_inference_app at master · NVIDIA-AI-IOT/deepstream_reference_apps · GitHub will help?
Currently in the pipeline Yolo and Human Attribute model are not in the purview of tracker since Yolo is working as SGIE and custom preprocess with Full Frame and Human attribute is operating on Yolo output. So any proposal to improve here.
Any other method already explained by NVIDIA.

Thanks

junshengy · February 11, 2025, 7:36am

If you have multiple cameras, increasing the batch size will usually improve performance. If you only have one camera, it will not improve performance.

Parallel inference does not improve performance either. This is just to show how to run multiple models simultaneously in a pipeline.

Use CUDA for preprocessing, optimize the model (use Int8 reasoning or other optimizations), and use the above-mentioned Nsight to analyze performance bottlenecks

theelitepro0224 · February 12, 2025, 8:36am

Thanks for your response. Can you also comment

nvinferserver (I guess Triton server) help in improving the performance?

I will check on optimizing the model.

Thanks

junshengy · February 12, 2025, 8:42am

Usually no impact on performance, nvinferserver is designed to support multimodal models or models with multiple inputs (nvinfer needs some tricks)

theelitepro0224 · February 12, 2025, 8:56am

Thanks for your quick reply and confirmation.

lqdisme · February 12, 2025, 9:08am

Could you show me an example of using nvinfer with multiple inputs, or point me to resources that would help me learn how to do this? @junshengy

Topic		Replies	Views
How to pass custom input to non image layer of model during runtime DeepStream SDK cuda , jetson-inference , gstreamer , jetson , deepstream	14	99	December 13, 2024
Some question about Deep stream 5 DeepStream SDK	42	1782	October 12, 2021
How to deploy skeleton-based action recognition model to deepstream? DeepStream SDK	49	349	August 27, 2024
How to append DeepStream Metadata in Python without using Streammux / nvinfer for parallel branch? DeepStream SDK	21	685	March 12, 2024
Cannot find the objectDetector_FastRCNN example DeepStream SDK deepstream	46	165	October 14, 2024
How can I use the nvdspreprocess plugin to rescale multiple streams in a streammux DeepStream SDK deepstream	24	36	March 7, 2025
Low FPS, randomness RTSP Stream DeepStream SDK	12	1015	July 20, 2022
Error generated while running the code after connecting the camera Jetson Xavier NX gstreamer , nvbugs	45	1252	January 2, 2024
DeepStream 7.1 nvinferserver tensor clone error DeepStream SDK deepstream	12	76	November 29, 2024
Sending Frames + MetaData (detections + classes + tracking IDs) to Kafka for each stream running DeepStream SDK kafka , deepstream	31	146	January 23, 2025

Too much frame drop in deepstream pipeline

Related topics