Deepstream-server FPS Keep Falling After A While

So we must know the model engine’s performance first. Have you measured it by “trtexec”?

Can I use this results from when I first convert it from onnx to trt?

[11/20/2025-02:09:52] [I] === Performance summary === 

[11/20/2025-02:09:52] [I] Throughput: 67.0395 qps 

[11/20/2025-02:09:52] [I] Latency: min = 24.5115 ms, max = 25.1259 ms, mean = 24.8803 ms, median = 24.8773 ms, percentile(90%) = 24.9915 ms, percentile(95%) = 25.0333 ms, percentile(99%) = 25.0992 ms 

[11/20/2025-02:09:52] [I] Enqueue Time: min = 0.864746 ms, max = 1.53534 ms, mean = 0.908439 ms, median = 0.892609 ms, percentile(90%) = 0.942047 ms, percentile(95%) = 0.963501 ms, percentile(99%) = 1.15454 ms 

[11/20/2025-02:09:52] [I] H2D Latency: min = 6.30591 ms, max = 6.34033 ms, mean = 6.31577 ms, median = 6.31271 ms, percentile(90%) = 6.32501 ms, percentile(95%) = 6.33289 ms, percentile(99%) = 6.33716 ms 

[11/20/2025-02:09:52] [I] GPU Compute Time: min = 14.6442 ms, max = 15.0804 ms, mean = 14.843 ms, median = 14.8311 ms, percentile(90%) = 14.9565 ms, percentile(95%) = 14.9995 ms, percentile(99%) = 15.0723 ms 

[11/20/2025-02:09:52] [I] D2H Latency: min = 3.43481 ms, max = 3.77759 ms, mean = 3.72158 ms, median = 3.72382 ms, percentile(90%) = 3.74524 ms, percentile(95%) = 3.75122 ms, percentile(99%) = 3.7627 ms 

[11/20/2025-02:09:52] [I] Total Host Walltime: 3.04298 s 

[11/20/2025-02:09:52] [I] Total GPU Compute Time: 3.02797 s

Yes. You can. Is the batch size 128?

Yep, my command when converting is:

trtexec --onnx=yolo11s.onnx --saveEngine=model.plan --minShapes=images:1x3x640x640 --optShapes=images:32x3x640x640 --maxShapes=images:128x3x640x640 --fp16

So the model engine seems quite fast. It can handle around 1000/25 x 128 =5120 frames per second.

  1. According to Video Codec SDK | NVIDIA Developer, L40s(similar to L20) can decode at most 80x1080p@30fps H264 video streams which can provides 30x80=2400 frames per second. If you have to use hardware decoder in your case, the hardware decoder will be a bottleneck.
  2. Even if the sources can provide enough frames for inferencing, you also need to check your customized postprocessing function NvDsInferParseYolo() performance to make sure it can handle more than 5120 times per second.

Good morning, Fiona.

I tried your advice on the custom parser problem and switch to the Ultralytics-recognized custom YOLO parser instead (GitHub - marcoslucianops/DeepStream-Yolo: NVIDIA DeepStream SDK 8.0 / 7.1 / 7.0 / 6.4 / 6.3 / 6.2 / 6.1.1 / 6.1 / 6.0.1 / 6.0 / 5.1 implementation for YOLO models). The results do look better for 20fps streams. I can manage up-to 45 streams before the pipeline die, with better SM utilization. Seeing that it is most probably the custom parser, is there perhaps a way to squeeze its performance? Perhaps changing the value of some of the config parameters?

I am using 20fps streams, so:

batch-push-timeout is 50000 Then, 
max stream size is 64 (goal is this)

Streammux:

max-same-source-frames=2        
max-num-frames-per-batch=64
num-surfaces-per-frame=1
buffer-pool-size=4

Should I change these in nvinferserver_yolo11s:
extra {
copy_input_to_host_buffers: false
output_buffer_pool_size: 32
}

My full configs (just in case):

nvinferserver_yolo11s.txt (1.7 KB)

custom_nvstreammux_config.txt (661 Bytes)

dsserver_config.txt (1.7 KB)

46 streams:
gpu mem usage:./deepstream-server-app               21816MiB

**PERF : FPS_0 (0.00)   FPS_1 (18.55)   FPS_2 (18.54)   FPS_3 (18.54)   
FPS_4 (18.54)   FPS_5 (18.54)   FPS_6 (18.54)   FPS_7 (18.54)   
FPS_8 (18.52)   FPS_9 (18.52)       FPS_10 (18.52)  FPS_11 (18.52)  
FPS_12 (18.52)  FPS_13 (18.51)  FPS_14 (18.51)  FPS_15 (18.49)  
FPS_16 (18.49)  FPS_17 (18.49)  FPS_18 (18.49)  FPS_19 (18.48)      
FPS_20 (18.04)  FPS_21 (18.04)  FPS_22 (18.04)  FPS_23 (18.04)  
FPS_24 (18.04)  FPS_25 (17.99)  FPS_26 (17.99)  FPS_27 (17.98)  
FPS_28 (17.98)  FPS_29 (17.98)      FPS_30 (17.53)  FPS_31 (17.53)  
FPS_32 (17.53)  FPS_33 (17.53)  FPS_34 (17.45)  FPS_35 (17.45) 
FPS_36 (17.44)  FPS_37 (17.45)  FPS_38 (17.44)  FPS_39 (17.45)      
FPS_40 (16.19)  FPS_41 (16.19)  FPS_42 (16.19)  FPS_43 (15.03)  
FPS_44 (14.34)  FPS_45 (8.79)

gpu    pwr  gtemp  mtemp     sm    mem    enc    dec    jpg    ofa   mclk   pclk

Idx      W      C      C      %      %      %      %      %      %    MHz    MHz

0    301     57      -     62     42      1     30      0      0   9001   2520 
0    292     53      -     65     48      1     29      0      0   9001   2520 
0    214     53      -     62     45      1     31      0      0   9001   2520 
0    156     54      -     63     47      1     30      0      0   9001   2520 
0    193     57      -     60     43      1     35      0      0   9001   2520 
0    254     57      -     61     44      1     32      0      0   9001   2520 
0    308     55      -     63     47      1     30      0      0   9001   2520 
0    269     53      -     65     46      1     29      0      0   9001   2520 
0    169     54      -     64     49      1     29      0      0   9001   2520 
0    175     54      -     61     45      1     31      0      0   9001   2520 
0    216     57      -     60     43      1     33      0      0   9001   2520 
0    292     57      -     63     43      1     30      0      0   9001   2520 
0    307     54      -     63     45      1     30      0      0   9001   2520 
0    237     53      -     64     45      1     30      0      0   9001   2520 
0     94     48      -      0      0      0      0      0      0   9001   2520 
0     86     47      -      0      0      0      0      0      0   9001   2520 
0     85     47      -      0      0      0      0      0      0   9001   2520 
0     84     46      -      0      0      0      0      0      0   9001   2520 


For the algorithm performance, you can consult the author of the algorithm for how to improve the algorithm efficiency itself. It is not provided by Nvidia.

For the common sense, we’d suggest you to implement the postprocessing tensor parsing algorithm with CUDA. Just as what we have done with deepstream_tools/yolo_deepstream/deepstream_yolo/config_infer_primary_yoloV11.txt at main · NVIDIA-AI-IOT/deepstream_tools

1 Like

By the way, @Fiona.Chen

Is it possible to dynamically add another model while the pipeline is running? (kinda like dynamic source but now we add/remove models instead). Also, the deepstream-server-app do not support SGIE is it? how do I add this option?

Can you provide more details of the scenario?

All the DeepStream samples demonstrate the usage of the DeepStream components and APIs. deepstream-server-app focuses on demonstrating the usage of “nvmultiurisrcbin” APIs. It has no conflict to the usage of “PGIE+SGIE”. The deepstream-test2 sample demonstrates how to construct “PGIE+SGIE” pipelines. You can construct the nvmultiurisrcbin+PGIE+SGIE pipeline according to the samples.

What option?

And I tested it, thanks @Fiona.Chen !

The results looks a bit better as I can add one more streams before collapse:

./deepstream-server-app               21792MiB

**PERF : FPS_0 (0.00)   FPS_1 (19.81)   FPS_2 (19.81)   FPS_3 (19.81)   
FPS_4 (19.81)   FPS_5 (19.81)   FPS_6 (19.81)   FPS_7 (19.81)   FPS_8 (19.81)FPS_9 (19.81)   FPS_10 (19.81)  FPS_11 (19.81)  
FPS_12 (19.81)  FPS_13 (19.81)  FPS_14 (19.81)  FPS_15 (19.81)  
FPS_16 (19.81)  FPS_17 (19.81)  FPS_18 (19.81)       FPS_19 (19.80)  
FPS_20 (19.80)  FPS_21 (19.80)  FPS_22 (19.80)  FPS_23 (19.80)  
FPS_24 (19.81)  FPS_25 (19.81)  FPS_26 (19.81)  FPS_27 (19.81)      
FPS_28 (19.81)  FPS_29 (19.81)  FPS_30 (19.81)  FPS_31 (19.81)  
FPS_32 (19.81)  FPS_33 (19.81)  FPS_34 (19.80)  FPS_35 (19.80)  
FPS_36 (19.81)       FPS_37 (19.81)  FPS_38 (19.81)  FPS_39 (19.81)  
FPS_40 (19.81)  FPS_41 (19.81)  FPS_42 (19.81)  FPS_43 (19.81)  
FPS_44 (19.80)  FPS_45 (19.76)

gpu    pwr  gtemp  mtemp     sm    mem    enc    dec    jpg    ofa   mclk   pclk

Idx      W      C      C      %      %      %      %      %      %    MHz    MHz

0    235     56      -     61     43      1     31      0      0   9001   2520 
0    316     54      -     65     46      1     31      0      0   9001   2520 
0    259     52      -     65     45      1     31      0      0   9001   2520 
0    167     53      -     65     47      1     31      0      0   9001   2520 
0    172     54      -     61     43      1     34      0      0   9001   2520 
0    240     56      -     61     43      1     31      0      0   9001   2520 
0    317     53      -     66     45      1     30      0      0   9001   2520 
0    253     53      -     64     45      1     31      0      0   9001   2520 
0    162     53      -     64     46      1     31      0      0   9001   2520 
0    176     55      -     61     43      1     35      0      0   9001   2520 
0    240     56      -     66     47      1     34      0      0   9001   2520 
0    317     54      -     61     43      1     34      0      0   9001   2520 
0    252     52      -     61     44      1     34      0      0   9001   2520 
0    165     53      -     63     44      1     30      0      0   9001   2520 
0    180     55      -     67     47      0     27      0      0   9001   2520 
0    253     56      -     65     47      1     34      0      0   9001   2520 
0    315     54      -     61     43      1     34      0      0   9001   2520 
0    224     53      -     61     43      1     35      0      0   9001   2520 
0    158     53      -     63     44      1     30      0      0   9001   2520 
0    185     56      -     67     47      0     27      0      0   9001   2520 
0    263     56      -     63     47      1     34      0      0   9001   2520 
0    304     53      -     60     42      1     36      0      0   9001   2520 
0    203     53      -     61     41      1     34      0      0   9001   2520 
0    158     54      -     65     45      1     27      0      0   9001   2520 
0    194     57      -     66     46      0     29      0      0   9001   2520 




adding 47 streams:

gpu    pwr  gtemp  mtemp     sm    mem    enc    dec    jpg    ofa   mclk   pclk

Idx      W      C      C      %      %      %      %      %      %    MHz    MHz

0    253     57      -     65     47      1     34      0      0   9001   2520 
0    311     54      -     61     43      1     34      0      0   9001   2520 
0    226     53      -     61     43      1     34      0      0   9001   2520 
0    158     54      -     63     44      1     29      0      0   9001   2520 
0    191     57      -     67     47      0     27      0      0   9001   2520 
0    283     57      -     63     47      1     34      0      0   9001   2520 
0    299     54      -     61     43      1     34      0      0   9001   2520 
0    197     53      -     61     41      1     34      0      0   9001   2520 
0    161     54      -     66     46      1     28      0      0   9001   2520 
0    199     57      -     66     46      0     29      0      0   9001   2520 
0    291     57      -     63     49      1     35      0      0   9001   2520 
0    293     53      -     60     41      1     35      0      0   9001   2520 
0    195     53      -     61     41      1     34      0      0   9001   2520 
0    162     55      -     66     46      1     29      0      0   9001   2520 
0    205     57      -     66     47      0     31      0      0   9001   2520 
0    299     55      -     62     47      1     35      0      0   9001   2520 
0    283     53      -     61     43      1     34      0      0   9001   2520 
0    181     53      -     60     39      1     35      0      0   9001   2520 
0    169     55      -     67     47      1     28      0      0   9001   2520 
0    224     57      -     66     47      0     30      0      0   9001   2520 
0    313     55      -     61     43      1     34      0      0   9001   2520 
0    250     53      -     60     42      1     35      0      0   9001   2520 
0    144     50      -      0      0      0      0      0      0   9001   2520 
0     86     48      -      0      0      0      0      0      0   9001   2520 
0     85     47      -      0      0      0      0      0      0   9001   2520 
0     84     47      -      0      0      0      0      0      0   9001   2520 
0     84     46      -      0      0      0      0      0      0   9001   2520

can we still maximize this or it is what it is for this case?

I meant, the dsserver_config have no sgie in it, so if i want to add one or two, do i just do it like the way we add SGIE in normall deepstream-app config?

Can i just add this in dsserver_config:

primary-gie:
plugin-type: 1
config-file-path: /workspace/configs/nvinferserver_yolo11s.txt
secondary-gie0:
  enable: 0
  plugin-type: 1
  config-file-path: dsserver_sgie_config.txt

...

secondary-gieN:
...
...

or do I need to adjust the cpp file?

I think I have told you the basic principle. You need to find the bottleneck according to your own implementation.

The deepstream-server-app does not use “deepstream-app” APIs. Please refer to deepstream-test2 sample for how to add SGIE in the pipeline.

All sample apps are open source. They are samples but not apps for the end users.

Alright, i got this. But how do i check for which pipeline is the slowest? Is there something i can monitor for each of this supposed parts?

  1. Have you guarantees the sources can provide 5120 frames per second?
  2. Have you break down all your downstream components to make sure they can handle 5120 frames per second?

Ok, so for the first one,
I did manage to push around 45 stable streams before collapsing after adding more.
So, total sources should be 20fps x 45 streams = 900 frames/s (way under 5120).

I don’t know how to check the second one. Any idea?

Have you measured the speed of the postprocessing?
2. Seems you have enabled the video encoder and file save with your configuration

So you also need to check the encoding and file save speed.
Video Codec SDK | NVIDIA Developer

  1. The deepstream-server is open source, you can check which components are enabled by your configuration and try to break down them one by one.
  1. Yep. Adding some time checking in the yolo11 parser from nvidia (the link you gave me), I can see that per frame latency for post-processing is around 0.1 ms.

    some samples:

    YOLOv11 Post-Process Latency: 0.114689 ms YOLOv11 Post-Process Latency: 0.089131 ms YOLOv11 Post-Process Latency: 0.09148 ms YOLOv11 Post-Process Latency: 0.1009 ms YOLOv11 Post-Process Latency: 0.121041 ms YOLOv11 Post-Process Latency: 0.091336 ms YOLOv11 Post-Process Latency: 0.109893 ms

Afternoon,

I am here to say that i now managed to add 128 streams dynamically and maintain the pipeline. I just need to adjust the deepstream_server_app.cpp file because that is where the pipeline is apparently.

@Fiona.Chen I do have another question, you can say whether its possible or not.

  1. Is it possible to dynamically add new models (SGIE2, etc) to the pipeline while its running?
  2. For each stream, is it possible to customize the tracker and its config?
  3. For each stream, is it possible to have different PGIE + dynamic crop + SGIE?

It depends on how do you define the meaning of “dynamically”.

If each stream has its own PGIE+SGIE, you don’t need to put them in the same pipeline, you can use multiple pipelines.