Too much frame drop in deepstream pipeline

hello, with reference to following ticket i am using 2 extra secondary preprocesses after PGIE to provide embeddings to my model’s non image input layer during runtime.

so i wanted to know that though this method worked for me, the FPS dropped to half, when i was using static embeddings without these extra preprocesses i was getting around 14-15 FPS but after including these two it is dropping to 7-8 FPS so can you tell me why is this happening?

NOTE: i observed same drop in FPS even with the given std preprocessing and no additional custom logic

nvdspreprocess usually does not affect fps, please try the test below, you can get similar results

without nvdspreprocess, run this command line.

 GST_DEBUG=fpsdisplaysink:6 gst-launch-1.0 uridecodebin uri=file:///opt/nvidia/deepstream/deepstream/samples/streams/sample_720p.h264 \ 
! mux.sink_0 nvstreammux name=mux batch-size=1 width=1280 height=720 ! \ 
nvinfer batch-size=1 config-file-path=/opt/nvidia/deepstream/deepstream/sources/apps/sample_apps/deepstream-preprocess-test/config_infer.txt ! nvvideoconvert \ 
! fpsdisplaysink sync=0 video-sink=fakesink

with nvdspreprocess
1.modify /opt/nvidia/deepstream/deepstream/sources/apps/sample_apps/deepstream-preprocess-test/config_preprocess.txt as follows

# [group-0]
# src-ids=0;1
# custom-input-transformation-function=CustomAsyncTransformation
# process-on-roi=1
# roi-params-src-0=300;200;700;800;1300;300;600;700
# roi-params-src-1=860;300;900;500;50;300;500;700

# [group-1]
# src-ids=2
# custom-input-transformation-function=CustomAsyncTransformation
# process-on-roi=1
# roi-params-src-2=50;300;500;700;650;300;500;500;1300;300;600;700

# [group-2]
# src-ids=3
# custom-input-transformation-function=CustomAsyncTransformation
# process-on-roi=0
# draw-roi=0
# roi-params-src-3=0;540;900;500;960;0;900;500

[group-0]
src-ids=0
custom-input-transformation-function=CustomAsyncTransformation
process-on-roi=0
  1. Then run this command line.
GST_DEBUG=fpsdisplaysink:6 gst-launch-1.0 uridecodebin uri=file:///opt/nvidia/deepstream/deepstream/samples/streams/sample_720p.h264 ! mux.sink_0 \
nvstreammux name=mux batch-size=1 width=1280 height=720 ! nvdspreprocess config-file=/opt/nvidia/deepstream/deepstream/sources/apps/sample_apps/deepstream-preprocess-test/config_preprocess.txt ! \
nvinfer input-tensor-meta=1  batch-size=1 config-file-path=/opt/nvidia/deepstream/deepstream/sources/apps/sample_apps/deepstream-preprocess-test/config_infer.txt \ 
! nvvideoconvert ! fpsdisplaysink sync=0 video-sink=fakesink

Please check whether the GPU/CPU usage increases after adding your preprocess.

You can use Nsight analysis tools for tuning,refer to this FAQ

hello, as per your advice i tried the test and as you can see in following console output that the fps drops after adding the preprocess

GST_DEBUG=fpsdisplaysink:6 gst-launch-1.0 uridecodebin uri=file:///opt/nvidia/deepstream/deepstream/samples/streams/sample_720p.h264 \
! mux.sink_0 nvstreammux name=mux batch-size=1 width=1280 height=720 ! \
nvinfer batch-size=1 config-file-path=/opt/nvidia/deepstream/deepstream/sources/apps/sample_apps/deepstream-preprocess-test/config_infer.txt ! nvvideoconvert \
! fpsdisplaysink sync=0 video-sink=fakesink
0:01:45.258600729 201547 0xaaaae9d9c8c0 DEBUG fpsdisplaysink fpsdisplaysink.c:373:display_current_fps: Updated max-fps to 731.508459
0:01:45.258679317 201547 0xaaaae9d9c8c0 DEBUG fpsdisplaysink fpsdisplaysink.c:377:display_current_fps: Updated min-fps to 731.508459
0:01:45.759887102 201547 0xaaaae9d9c8c0 DEBUG fpsdisplaysink fpsdisplaysink.c:373:display_current_fps: Updated max-fps to 859.789464
0:01:46.260722140 201547 0xaaaae9d9c8c0 DEBUG fpsdisplaysink fpsdisplaysink.c:373:display_current_fps: Updated max-fps to 906.488186

GST_DEBUG=fpsdisplaysink:6 gst-launch-1.0 uridecodebin uri=file:///opt/nvidia/deepstream/deepstream/samples/streams/sample_720p.h264 ! mux.sink_0
nvstreammux name=mux batch-size=1 width=1280 height=720 ! nvdspreprocess config-file=/opt/nvidia/deepstream/deepstream/sources/apps/sample_apps/deepstream-preprocess-test/config_preprocess.txt !
nvinfer input-tensor-meta=1 batch-size=1 config-file-path=/opt/nvidia/deepstream/deepstream/sources/apps/sample_apps/deepstream-preprocess-test/config_infer.txt \
! nvvideoconvert ! fpsdisplaysink sync=0 video-sink=fakesink
0:01:46.219996797 201794 0xaaaaec92f060 DEBUG fpsdisplaysink fpsdisplaysink.c:373:display_current_fps: Updated max-fps to 637.475991
0:01:46.220042556 201794 0xaaaaec92f060 DEBUG fpsdisplaysink fpsdisplaysink.c:377:display_current_fps: Updated min-fps to 637.475991
0:01:46.720385690 201794 0xaaaaec92f060 DEBUG fpsdisplaysink fpsdisplaysink.c:373:display_current_fps: Updated max-fps to 865.326739
0:01:47.221709589 201794 0xaaaaec92f060 DEBUG fpsdisplaysink fpsdisplaysink.c:373:display_current_fps: Updated max-fps to 917.592245

you mean should i check for the test that you gave or on my own pipeline?

The second command line is the one that added nvdspreprocess,fps improved after adding preprocessing

https://drive.google.com/drive/folders/1elgO7nOYJBVjzo1ZrK3fjN-q0WTxZO2L?usp=sharing

this folder contains the report generated using nsys for my deepstream app, i tried to analyse this using the GUI tool but i couldn’t find anything of use, so please if you can tell me where is the bottleneck.

The first figure shows that you spend a lot of time in the custom tensor, an average of 100ms, and up to 4s when batch_num=1027

In the second figure, when batch_num=1027, a large amount of time is consumed in cudaMemoryCpy in up to 4s, which is usually caused by synchronization waiting.

You need to check why cudaMemorycpy causes blocking.And optimize the custom tensor conversion function, 100+ms is too long

Thank you very much for your help, in pointing this out but can you also tell me how can i find which custom tensor is causing this as i am using 4 preprocess plugins to prepare custom tensors, one is for my person detection model, one is for recognition model ad other two are for the swap model.

can you provide any guidance on this synchronisation issue because the cudaMemcpy that i have used are used in just copying the 1x512 dimesional fp32 embeddings and copying the modified 128x128 frame in one of the preprocess

You can find a lot of examples of nvtxDomainRangePushEx in the code. make track points in your code and use nsight to find related functions.

  nvtxDomainRangePushEx(ctx->nvtx_domain, &eventAttrib);

Please check the CUDA programming manual. This may be because your buffer is not ready.

thanks for your help but i had few more queries actually as per your suggestion i am using 2 sources for input which is causing the fps drop from 30fps to 15fps but if i am using the non-img layer preprocess on single source only then fps are 22 which not much of a drop but then i am facing another problem that all faces are being swapped except every 12th image and also i used a counter in the non-img layer preprocess and custom parser of face swapper model then i observed one thing that

  1. if non-img preprocess is enabled then parser count is 924 and preprocesses count is 462
  2. if non-img preprocess is not enabled then parser count is 462

This is normal. Processing multiple objects will consume more GPU/CPU time, so please optimize your processing function first.

Sorry I don’t understand what this is, please refer to the above reply to optimize the model and pre-processing

Yes, we know this. The point we want to drive home here is that if we use single source then the fps drop is not significant. So we want to go with single source. But with single source we found that every 12 image generated out of swap model is black/white. So we want to know how to resolve this. We want to go with single source only as it seems that it can work and give better performance and not with two sources for each of the pre-process.

We put a static counter in the custom pre-process and custom parser to check how many times it is called. We found that with two pre-process one of image layer and other for non-image layer the custom parser is called twice. Why?

We were able to resolve this by following

  1. fine tuning the RTSP buffers in configuration file
  2. removing the pre-process and adopt second approach as mentioned in How to pass custom input to non image layer of model during runtime

Glad to hear that, so currently you are using nvinferserver ?

No, currently we are using nvinfer only. Will nvinferserver (I guess Triton server) help in improving the performance? What are the other options to improve the performance like

  1. Increasing the batch size?
  2. Implementing architecture like deepstream_reference_apps/deepstream_parallel_inference_app at master · NVIDIA-AI-IOT/deepstream_reference_apps · GitHub will help?
  3. Currently in the pipeline Yolo and Human Attribute model are not in the purview of tracker since Yolo is working as SGIE and custom preprocess with Full Frame and Human attribute is operating on Yolo output. So any proposal to improve here.
  4. Any other method already explained by NVIDIA.

Thanks

If you have multiple cameras, increasing the batch size will usually improve performance. If you only have one camera, it will not improve performance.

Parallel inference does not improve performance either. This is just to show how to run multiple models simultaneously in a pipeline.

Use CUDA for preprocessing, optimize the model (use Int8 reasoning or other optimizations), and use the above-mentioned Nsight to analyze performance bottlenecks

Thanks for your response. Can you also comment

nvinferserver (I guess Triton server) help in improving the performance?

I will check on optimizing the model.

Thanks

Usually no impact on performance, nvinferserver is designed to support multimodal models or models with multiple inputs (nvinfer needs some tricks)

Thanks for your quick reply and confirmation.

Could you show me an example of using nvinfer with multiple inputs, or point me to resources that would help me learn how to do this? @junshengy