So we must know the model engine’s performance first. Have you measured it by “trtexec”?
Can I use this results from when I first convert it from onnx to trt?
[11/20/2025-02:09:52] [I] === Performance summary ===
[11/20/2025-02:09:52] [I] Throughput: 67.0395 qps
[11/20/2025-02:09:52] [I] Latency: min = 24.5115 ms, max = 25.1259 ms, mean = 24.8803 ms, median = 24.8773 ms, percentile(90%) = 24.9915 ms, percentile(95%) = 25.0333 ms, percentile(99%) = 25.0992 ms
[11/20/2025-02:09:52] [I] Enqueue Time: min = 0.864746 ms, max = 1.53534 ms, mean = 0.908439 ms, median = 0.892609 ms, percentile(90%) = 0.942047 ms, percentile(95%) = 0.963501 ms, percentile(99%) = 1.15454 ms
[11/20/2025-02:09:52] [I] H2D Latency: min = 6.30591 ms, max = 6.34033 ms, mean = 6.31577 ms, median = 6.31271 ms, percentile(90%) = 6.32501 ms, percentile(95%) = 6.33289 ms, percentile(99%) = 6.33716 ms
[11/20/2025-02:09:52] [I] GPU Compute Time: min = 14.6442 ms, max = 15.0804 ms, mean = 14.843 ms, median = 14.8311 ms, percentile(90%) = 14.9565 ms, percentile(95%) = 14.9995 ms, percentile(99%) = 15.0723 ms
[11/20/2025-02:09:52] [I] D2H Latency: min = 3.43481 ms, max = 3.77759 ms, mean = 3.72158 ms, median = 3.72382 ms, percentile(90%) = 3.74524 ms, percentile(95%) = 3.75122 ms, percentile(99%) = 3.7627 ms
[11/20/2025-02:09:52] [I] Total Host Walltime: 3.04298 s
[11/20/2025-02:09:52] [I] Total GPU Compute Time: 3.02797 s
Yes. You can. Is the batch size 128?
Yep, my command when converting is:
trtexec --onnx=yolo11s.onnx --saveEngine=model.plan --minShapes=images:1x3x640x640 --optShapes=images:32x3x640x640 --maxShapes=images:128x3x640x640 --fp16
So the model engine seems quite fast. It can handle around 1000/25 x 128 =5120 frames per second.
- According to Video Codec SDK | NVIDIA Developer, L40s(similar to L20) can decode at most 80x1080p@30fps H264 video streams which can provides 30x80=2400 frames per second. If you have to use hardware decoder in your case, the hardware decoder will be a bottleneck.
- Even if the sources can provide enough frames for inferencing, you also need to check your customized postprocessing function NvDsInferParseYolo() performance to make sure it can handle more than 5120 times per second.
Good morning, Fiona.
I tried your advice on the custom parser problem and switch to the Ultralytics-recognized custom YOLO parser instead (GitHub - marcoslucianops/DeepStream-Yolo: NVIDIA DeepStream SDK 8.0 / 7.1 / 7.0 / 6.4 / 6.3 / 6.2 / 6.1.1 / 6.1 / 6.0.1 / 6.0 / 5.1 implementation for YOLO models). The results do look better for 20fps streams. I can manage up-to 45 streams before the pipeline die, with better SM utilization. Seeing that it is most probably the custom parser, is there perhaps a way to squeeze its performance? Perhaps changing the value of some of the config parameters?
I am using 20fps streams, so:
batch-push-timeout is 50000 Then,
max stream size is 64 (goal is this)
Streammux:
max-same-source-frames=2
max-num-frames-per-batch=64
num-surfaces-per-frame=1
buffer-pool-size=4
Should I change these in nvinferserver_yolo11s:
extra {
copy_input_to_host_buffers: false
output_buffer_pool_size: 32
}
My full configs (just in case):
nvinferserver_yolo11s.txt (1.7 KB)
custom_nvstreammux_config.txt (661 Bytes)
dsserver_config.txt (1.7 KB)
46 streams:
gpu mem usage:./deepstream-server-app 21816MiB
**PERF : FPS_0 (0.00) FPS_1 (18.55) FPS_2 (18.54) FPS_3 (18.54)
FPS_4 (18.54) FPS_5 (18.54) FPS_6 (18.54) FPS_7 (18.54)
FPS_8 (18.52) FPS_9 (18.52) FPS_10 (18.52) FPS_11 (18.52)
FPS_12 (18.52) FPS_13 (18.51) FPS_14 (18.51) FPS_15 (18.49)
FPS_16 (18.49) FPS_17 (18.49) FPS_18 (18.49) FPS_19 (18.48)
FPS_20 (18.04) FPS_21 (18.04) FPS_22 (18.04) FPS_23 (18.04)
FPS_24 (18.04) FPS_25 (17.99) FPS_26 (17.99) FPS_27 (17.98)
FPS_28 (17.98) FPS_29 (17.98) FPS_30 (17.53) FPS_31 (17.53)
FPS_32 (17.53) FPS_33 (17.53) FPS_34 (17.45) FPS_35 (17.45)
FPS_36 (17.44) FPS_37 (17.45) FPS_38 (17.44) FPS_39 (17.45)
FPS_40 (16.19) FPS_41 (16.19) FPS_42 (16.19) FPS_43 (15.03)
FPS_44 (14.34) FPS_45 (8.79)
gpu pwr gtemp mtemp sm mem enc dec jpg ofa mclk pclk
Idx W C C % % % % % % MHz MHz
0 301 57 - 62 42 1 30 0 0 9001 2520
0 292 53 - 65 48 1 29 0 0 9001 2520
0 214 53 - 62 45 1 31 0 0 9001 2520
0 156 54 - 63 47 1 30 0 0 9001 2520
0 193 57 - 60 43 1 35 0 0 9001 2520
0 254 57 - 61 44 1 32 0 0 9001 2520
0 308 55 - 63 47 1 30 0 0 9001 2520
0 269 53 - 65 46 1 29 0 0 9001 2520
0 169 54 - 64 49 1 29 0 0 9001 2520
0 175 54 - 61 45 1 31 0 0 9001 2520
0 216 57 - 60 43 1 33 0 0 9001 2520
0 292 57 - 63 43 1 30 0 0 9001 2520
0 307 54 - 63 45 1 30 0 0 9001 2520
0 237 53 - 64 45 1 30 0 0 9001 2520
0 94 48 - 0 0 0 0 0 0 9001 2520
0 86 47 - 0 0 0 0 0 0 9001 2520
0 85 47 - 0 0 0 0 0 0 9001 2520
0 84 46 - 0 0 0 0 0 0 9001 2520
For the algorithm performance, you can consult the author of the algorithm for how to improve the algorithm efficiency itself. It is not provided by Nvidia.
For the common sense, we’d suggest you to implement the postprocessing tensor parsing algorithm with CUDA. Just as what we have done with deepstream_tools/yolo_deepstream/deepstream_yolo/config_infer_primary_yoloV11.txt at main · NVIDIA-AI-IOT/deepstream_tools
By the way, @Fiona.Chen
Is it possible to dynamically add another model while the pipeline is running? (kinda like dynamic source but now we add/remove models instead). Also, the deepstream-server-app do not support SGIE is it? how do I add this option?
Can you provide more details of the scenario?
All the DeepStream samples demonstrate the usage of the DeepStream components and APIs. deepstream-server-app focuses on demonstrating the usage of “nvmultiurisrcbin” APIs. It has no conflict to the usage of “PGIE+SGIE”. The deepstream-test2 sample demonstrates how to construct “PGIE+SGIE” pipelines. You can construct the nvmultiurisrcbin+PGIE+SGIE pipeline according to the samples.
What option?
And I tested it, thanks @Fiona.Chen !
The results looks a bit better as I can add one more streams before collapse:
./deepstream-server-app 21792MiB
**PERF : FPS_0 (0.00) FPS_1 (19.81) FPS_2 (19.81) FPS_3 (19.81)
FPS_4 (19.81) FPS_5 (19.81) FPS_6 (19.81) FPS_7 (19.81) FPS_8 (19.81)FPS_9 (19.81) FPS_10 (19.81) FPS_11 (19.81)
FPS_12 (19.81) FPS_13 (19.81) FPS_14 (19.81) FPS_15 (19.81)
FPS_16 (19.81) FPS_17 (19.81) FPS_18 (19.81) FPS_19 (19.80)
FPS_20 (19.80) FPS_21 (19.80) FPS_22 (19.80) FPS_23 (19.80)
FPS_24 (19.81) FPS_25 (19.81) FPS_26 (19.81) FPS_27 (19.81)
FPS_28 (19.81) FPS_29 (19.81) FPS_30 (19.81) FPS_31 (19.81)
FPS_32 (19.81) FPS_33 (19.81) FPS_34 (19.80) FPS_35 (19.80)
FPS_36 (19.81) FPS_37 (19.81) FPS_38 (19.81) FPS_39 (19.81)
FPS_40 (19.81) FPS_41 (19.81) FPS_42 (19.81) FPS_43 (19.81)
FPS_44 (19.80) FPS_45 (19.76)
gpu pwr gtemp mtemp sm mem enc dec jpg ofa mclk pclk
Idx W C C % % % % % % MHz MHz
0 235 56 - 61 43 1 31 0 0 9001 2520
0 316 54 - 65 46 1 31 0 0 9001 2520
0 259 52 - 65 45 1 31 0 0 9001 2520
0 167 53 - 65 47 1 31 0 0 9001 2520
0 172 54 - 61 43 1 34 0 0 9001 2520
0 240 56 - 61 43 1 31 0 0 9001 2520
0 317 53 - 66 45 1 30 0 0 9001 2520
0 253 53 - 64 45 1 31 0 0 9001 2520
0 162 53 - 64 46 1 31 0 0 9001 2520
0 176 55 - 61 43 1 35 0 0 9001 2520
0 240 56 - 66 47 1 34 0 0 9001 2520
0 317 54 - 61 43 1 34 0 0 9001 2520
0 252 52 - 61 44 1 34 0 0 9001 2520
0 165 53 - 63 44 1 30 0 0 9001 2520
0 180 55 - 67 47 0 27 0 0 9001 2520
0 253 56 - 65 47 1 34 0 0 9001 2520
0 315 54 - 61 43 1 34 0 0 9001 2520
0 224 53 - 61 43 1 35 0 0 9001 2520
0 158 53 - 63 44 1 30 0 0 9001 2520
0 185 56 - 67 47 0 27 0 0 9001 2520
0 263 56 - 63 47 1 34 0 0 9001 2520
0 304 53 - 60 42 1 36 0 0 9001 2520
0 203 53 - 61 41 1 34 0 0 9001 2520
0 158 54 - 65 45 1 27 0 0 9001 2520
0 194 57 - 66 46 0 29 0 0 9001 2520
adding 47 streams:
gpu pwr gtemp mtemp sm mem enc dec jpg ofa mclk pclk
Idx W C C % % % % % % MHz MHz
0 253 57 - 65 47 1 34 0 0 9001 2520
0 311 54 - 61 43 1 34 0 0 9001 2520
0 226 53 - 61 43 1 34 0 0 9001 2520
0 158 54 - 63 44 1 29 0 0 9001 2520
0 191 57 - 67 47 0 27 0 0 9001 2520
0 283 57 - 63 47 1 34 0 0 9001 2520
0 299 54 - 61 43 1 34 0 0 9001 2520
0 197 53 - 61 41 1 34 0 0 9001 2520
0 161 54 - 66 46 1 28 0 0 9001 2520
0 199 57 - 66 46 0 29 0 0 9001 2520
0 291 57 - 63 49 1 35 0 0 9001 2520
0 293 53 - 60 41 1 35 0 0 9001 2520
0 195 53 - 61 41 1 34 0 0 9001 2520
0 162 55 - 66 46 1 29 0 0 9001 2520
0 205 57 - 66 47 0 31 0 0 9001 2520
0 299 55 - 62 47 1 35 0 0 9001 2520
0 283 53 - 61 43 1 34 0 0 9001 2520
0 181 53 - 60 39 1 35 0 0 9001 2520
0 169 55 - 67 47 1 28 0 0 9001 2520
0 224 57 - 66 47 0 30 0 0 9001 2520
0 313 55 - 61 43 1 34 0 0 9001 2520
0 250 53 - 60 42 1 35 0 0 9001 2520
0 144 50 - 0 0 0 0 0 0 9001 2520
0 86 48 - 0 0 0 0 0 0 9001 2520
0 85 47 - 0 0 0 0 0 0 9001 2520
0 84 47 - 0 0 0 0 0 0 9001 2520
0 84 46 - 0 0 0 0 0 0 9001 2520
can we still maximize this or it is what it is for this case?
I meant, the dsserver_config have no sgie in it, so if i want to add one or two, do i just do it like the way we add SGIE in normall deepstream-app config?
Can i just add this in dsserver_config:
primary-gie:
plugin-type: 1
config-file-path: /workspace/configs/nvinferserver_yolo11s.txt
secondary-gie0:
enable: 0
plugin-type: 1
config-file-path: dsserver_sgie_config.txt
...
secondary-gieN:
...
...
or do I need to adjust the cpp file?
I think I have told you the basic principle. You need to find the bottleneck according to your own implementation.
The deepstream-server-app does not use “deepstream-app” APIs. Please refer to deepstream-test2 sample for how to add SGIE in the pipeline.
All sample apps are open source. They are samples but not apps for the end users.
Alright, i got this. But how do i check for which pipeline is the slowest? Is there something i can monitor for each of this supposed parts?
- Have you guarantees the sources can provide 5120 frames per second?
- Have you break down all your downstream components to make sure they can handle 5120 frames per second?
Ok, so for the first one,
I did manage to push around 45 stable streams before collapsing after adding more.
So, total sources should be 20fps x 45 streams = 900 frames/s (way under 5120).
I don’t know how to check the second one. Any idea?
Have you measured the speed of the postprocessing?
2. Seems you have enabled the video encoder and file save with your configuration
So you also need to check the encoding and file save speed.
Video Codec SDK | NVIDIA Developer
- The deepstream-server is open source, you can check which components are enabled by your configuration and try to break down them one by one.
-
Yep. Adding some time checking in the yolo11 parser from nvidia (the link you gave me), I can see that per frame latency for post-processing is around 0.1 ms.
some samples:
YOLOv11 Post-Process Latency: 0.114689 ms YOLOv11 Post-Process Latency: 0.089131 ms YOLOv11 Post-Process Latency: 0.09148 ms YOLOv11 Post-Process Latency: 0.1009 ms YOLOv11 Post-Process Latency: 0.121041 ms YOLOv11 Post-Process Latency: 0.091336 ms YOLOv11 Post-Process Latency: 0.109893 ms
Afternoon,
I am here to say that i now managed to add 128 streams dynamically and maintain the pipeline. I just need to adjust the deepstream_server_app.cpp file because that is where the pipeline is apparently.
@Fiona.Chen I do have another question, you can say whether its possible or not.
- Is it possible to dynamically add new models (SGIE2, etc) to the pipeline while its running?
- For each stream, is it possible to customize the tracker and its config?
- For each stream, is it possible to have different PGIE + dynamic crop + SGIE?
It depends on how do you define the meaning of “dynamically”.
If each stream has its own PGIE+SGIE, you don’t need to put them in the same pipeline, you can use multiple pipelines.