Deepstream-server FPS Keep Falling After A While

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU): dGPU
• DeepStream Version: 8.0
• JetPack Version (valid for Jetson only)
• TensorRT Version: 10.9.0.34
• NVIDIA GPU Driver Version (valid for GPU only): L20
• Issue Type( questions, new requirements, bugs): questions

Hello NVIDIA team, I wanna ask about this problem I have when I am using the deepstream-server sample application for my pipeline. I am very new to all of this so your guidance will help a lot.

Currently, I am trying to create a pipeline where:

  1. I can add/delete models dynamically without restarting the server,
  2. i can add/delete streams without stopping and restarting the server

Using the files here: https://drive.google.com/drive/folders/1llRT7l-UgnPArC1AoFgRmEJhL9M8CEap?usp=sharing

I basically do:

Restart docker compose

docker compose down && docker compose up -d

Terminal 1: Triton Server

cd /root/saas-triton-deepstream

docker compose exec deepstream bash

cd /workspace/model_repository

tritonserver --model-repository=/workspace/model_repository \

  --model-control-mode=poll \

  --repository-poll-secs 30 \

  --pinned-memory-pool-byte-size 1073741824 \

  --cuda-memory-pool-byte-size 0:4294967296


Terminal 3: Run Stream

source /root/samurai-copilot-saas/testenv/bin/activate

python /root/saas-triton-deepstream/scripts/stream_publisher.py /root/saas-triton-deepstream/test-media/sample_1080p_h264_20fps.mp4 -n 32


Terminal 2: DeepStream with REST API

cd /root/saas-triton-deepstream

docker compose exec deepstream bash

export GST_PLUGIN_PATH=/opt/nvidia/deepstream/deepstream-8.0/lib/gst-plugins:$GST_PLUGIN_PATH

export LD_LIBRARY_PATH=/workspace/lib:/opt/nvidia/deepstream/deepstream-8.0/lib:$LD_LIBRARY_PATH

/workspace/deepstream-app-server/start_deepstream_server.sh

I test this and add the streams one-b-one to check for when the bottleneck occurs. And It seems when the streams reach the 19 counts it starts to go bad:
example with 16 streams:
**PERF : FPS_0 (0.00) FPS_1 (19.90) FPS_2 (19.90) FPS_3 (19.89) FPS_4 (19.88) FPS_5 (19.88) FPS_6 (19.89) FPS_7 (19.89) FPS_8 (19.88) FPS_9 (19.84) FPS_10 (19.86) FPS_11 (19.88) FPS_12 (19.88) FPS_13 (19.88) FPS_14 (19.82) FPS_15 (19.82)

results using 21 streams:
**PERF : FPS_0 (0.00) FPS_1 (19.60) FPS_2 (19.60) FPS_3 (19.60) FPS_4 (19.42) FPS_5 (19.42) FPS_6 (19.42) FPS_7 (19.40) FPS_8 (19.02) FPS_9 (18.95) FPS_10 (18.95) FPS_11 (18.87) FPS_12 (18.87) FPS_13 (18.87) FPS_14 (18.77) FPS_15 (18.77) FPS_16 (18.69) FPS_17 (15.51) FPS_18 (15.50) FPS_19 (16.62) FPS_20 (16.65)

nvidia-smi dmon output:

gpu    pwr  gtemp  mtemp     sm    mem    enc    dec    jpg    ofa   mclk   pclk

Idx      W      C      C      %      %      %      %      %      %    MHz    MHz

0    133     45      -     15      8      0      8      0      0   9001   2520 


0    122     47      -     26     15      0     15      0      0   9001   2520 
0    136     47      -     21     10      0      7      0      0   9001   2520 
0    151     48      -     26     15      0     14      0      0   9001   2520 
0    125     46      -     15      8      0      7      0      0   9001   2520 
0    130     45      -     26     15      0     15      0      0   9001   2520 
0    133     48      -     25     15      0     15      0      0   9001   2520 
0    123     46      -     17      9      0      8      0      0   9001   2520 
0    132     46      -     24     14      0     11      0      0   9001   2520 
0    140     45      -     17     12      0     15      0      0   9001   2520 
0    121     46      -     24     14      0      9      0      0   9001   2520 
0    135     45      -     26     15      0     14      0      0   9001   2520 
0    150     46      -     16      9      0      7      0      0   9001   2520 
0    129     47      -     21     11      0      8      0      0   9001   2520 
0    133     47      -     24     14      0     10      0      0   9001   2520 
0    145     45      -     26     15      0     15      0      0   9001   2520 
0    116     46      -     15      8      0      7      0      0   9001   2520 
0    135     46      -     26     15      0     15      0      0   9001   2520 
0    126     46      -     26     15      0     15      0      0   9001   2520 
0    131     46      -     26     15      0     15      0      0   9001   2520 
0    133     45      -     25     15      0     15      0      0   9001   2520 
0    133     46      -     16     10      0     14      0      0   9001   2520 
0    122     46      -     24     15      0     15      0      0   9001   2520 
0    134     46      -     24     15      0     15      0      0   9001   2520 
0    137     45      -     26     16      0     15      0      0   9001   2520 
0    128     47      -     14      8      0      9      0      0   9001   2520 
0    140     46      -     16      8      0      9      0      0   9001   2520 
0    138     45      -     24     14      0      8      0      0   9001   2520 
0    122     46      -     15      9      0     14      0      0   9001   2520 

So, I also already tried the formula from:
DeepStream tune fps and also from several AI platforms.

So now I kinda stuck because I dunno what else should I do to enhance the GPU usage and performance overall. My goal is to be able to stream at least 20 streams at the same time with stable fps output.

FYI im using a ECS instance from Alibaba Cloud for this. Running it in my VSC with SSH because I dunno where else should i do it. Any help is greatly appreciated!!

There is only one RTSP stream generated from the ffmpeg rtsp stream pushing script. Do you mean you add the same rtsp stream multiple times to simulate the multiple rtsp streams?

The bandwidth of the tcp/udp port may be a limitation for such case.

  1. In your “nvinferserver_yolo11s.txt”, I saw that you are using Triton grpc interface. The GRPC bandwidth will be another limitation too. The suggestion is to use capi instead of grpc API if you have to use Triton for some unavoidable reasons.
  2. You need to measure the speed of your customized postprocessing function NvDsInferParseYolo() to make sure it will not be a bottleneck when the stream number become larger. It is a callback which will be evoked for every frame in the batch.

Thank you for the fast reply!

For the streams, I use the same video file to generate 32 streams, each posted at “localhost:8554/stream_n” with n being the number of streams I want.
python /scripts/stream_publisher.py /test-media/sample_1080p_h264_20fps.mp4 -n 32

I add the stream to the pipeline by using this line:

python3 rest_api_client.py add \

  --id cam001 \

  --name "Front Door" \

  --url rtsp://mediamtx:8554/stream1


then add another:
python3 rest_api_client.py add \

  --id cam002 \

  --name "Front Door" \

  --url rtsp://mediamtx:8554/stream2

...

until I reach id cam031.

In your “nvinferserver_yolo11s.txt”, I saw that you are using Triton grpc interface. The GRPC bandwidth will be another limitation too. The suggestion is to use capi instead of grpc API if you have to use Triton for some unavoidable reasons.

I actually refer to your sample deepstream configs to kinda seperate triton and deepstream. This is because I wanna manage my model repository dynamically as well. Do you guys have a guide for this cAPI for Triton? because yeah, I need to use the Triton in this situation.

You need to measure the speed of your customized postprocessing function NvDsInferParseYolo() to make sure it will not be a bottleneck when the stream number become larger. It is a callback which will be evoked for every frame in the batch.

I see. Is it possible to measure this within the deepstream itself?

Thank you for the help!!

The capi samples:
/opt/nvidia/deepstream/deepstream/samples/configs/deepstream-app-triton

Please read the README file under this folder first.

No. You can add some code to calculate the time.

You can refer to On the Fly Model Update — DeepStream documentation to check whether it can meet your requirement.

1 Like

Alright, Thanks Fiona!

I will try this first and report back if still got problem.

One more thing, When I run this with codec=2 in dsserver_config.yml and encoder enabled, I got this error:

Calling gst_element_factory_make for nvmultiurisrcbin
Using file: /workspace/configs/dsserver_config.yml
Running...
ERROR from element file-sink: No file name specified for writing.
Error details: ../plugins/elements/gstfilesink.c(501): gst_file_sink_open_file (): /GstPipeline:dsserver-pipeline/GstFileSink:file-sink
Returned, stopping playback
Deleting pipeline


Meanwhile, using codec=1 the app run just fine. I didnt change anything else other than that.

What could possibly be the problem?

Another question that I have:

If i am using cAPI instead, that means the model repo is mentioned in the nvinferserver_config file. Does this mean whenever I run the deepstream-server, triton will activate and when I close it it will deactivate alongside deepstream?

What if I make some changes in the model config.pbtxt file, for example, changing the version. Will the triton pick it up and update the model? Or do I need to re-run the deepstream-server to make use of the change?

What do you mean by “close”?

It is a little bug in /opt/nvidia/deepstream/deepstream/sources/apps/sample_apps/deepstream-server/deepstream_server_app.cpp, in line 453, you can add one line
g_object_set (G_OBJECT (appctx.sink), "location", "out.hevc", NULL);

Sorry, I have limited understanding of this topic as well. i mean, close as in when I stop the deepstream-server with ctrl+c combo

That means the whole process is complete. For capi case, the Triton is closed either. You can change the modelbefore you start the deepstream-server process again.

And it is also works if you use nvinfer for such case. when the whole process is down, you can replace the model and processings.

I see. Because in your deepstream-Triton with gRPC , the Triton Inference Server will run in a separate deepstream-triton container, so it can dynamically check model-repo even when I close my deepstream-app/when its not running/when its actively running.

Since I want to add multiple random models like yolo and rfdetr-small, I dont think your On the Fly Model Update — DeepStream documentation can help with this situation.

I have another problem suddenly.

I did docker down && docker compose up -d just now, then when I go back to where I run the deepstream (inside the container), now without gRPC (using cAPI instead), I keep getting this error:

Context: trying to add a new stream (RTSP) using the same method as before.

Calling gst_element_factory_make for nvmultiurisrcbin
Using file: /workspace/configs/dsserver_config.yml
Opening in BLOCKING MODE
Civetweb version: v1.16
Server running at port: 9000
0:00:00.144610485 55609 0x5576efcfd570 WARN           nvinferserver gstnvinferserver_impl.cpp:365:validatePluginConfig: warning: Configuration file batch-size reset to: 32
INFO: TrtISBackend id:1 initialized model: yolo11s
Running…

WARNING from element primary-nvinference-engine: Configuration file batch-size reset to: 32
Warning: Configuration file batch-size reset to: 32

uri:/api/v1/stream/add
method:POST
Protocol set: 0x7
FATAL 56: uncaught error
PANIC 56: uncaught error (calling abort)
/workspace/deepstream-app-server/start_deepstream_server.sh: line 109: 55609 Aborted                 (core dumped) ./deepstream-server-app “$CONFIG_FILE”

I remember I did got similar problem before, then I kinda edit the rest_server_callbacks.cpp file, then It works. But doing the same thing doesnt help me now. What should I do?

Also, the fix for h265 by adding that new line doesn’t work for me as well.

wait nevermind, I got it.

Morning, @Fiona.Chen

I did what you recommended to me yesterday and managed to improve the FPS report using 32 streams.
**PERF : FPS_0 (0.00) FPS_1 (19.75) FPS_2 (19.75) FPS_3 (19.74) FPS_4 (19.74) FPS_5 (19.74) FPS_6 (19.74) FPS_7 (19.74) FPS_8 (19.74) FPS_9 (19.74) FPS_10 (19.73) FPS_11 (19.73) FPS_12 (19.73) FPS_13 (19.73) FPS_14 (19.73) FPS_15 (19.73) FPS_16 (19.73) FPS_17 (19.72) FPS_18 (19.72) FPS_19 (19.72) FPS_20 (19.67) FPS_21 (19.50) FPS_22 (19.50) FPS_23 (19.34) FPS_24 (19.34) FPS_25 (19.33) FPS_26 (19.31) FPS_27 (19.27) FPS_28 (19.27) FPS_29 (18.86) FPS_30 (18.77) FPS_31 (18.77)

gpu    pwr  gtemp  mtemp     sm    mem    enc    dec    jpg    ofa   mclk   pclk

Idx      W      C      C      %      %      %      %      %      %    MHz    MHz

0    199     52      -     41     28      0     25      0      0   9001   2520 
0    150     51      -     41     28      0     25      0      0   9001   2520 

0    190     51      -     41     27      0     24      0      0   9001   2520 
0    191     54      -     40     28      0     25      0      0   9001   2520 
0    189     53      -     41     28      0     25      0      0   9001   2520 
0    207     50      -     36     25      0     24      0      0   9001   2520 
0    181     50      -     35     24      0     23      0      0   9001   2520 
0    180     51      -     36     26      0     25      0      0   9001   2520 
0    155     53      -     41     28      0     25      0      0   9001   2520 
0    186     54      -     41     28      0     25      0      0   9001   2520 
0    185     51      -     40     27      0     24      0      0   9001   2520 
0    184     51      -     38     25      0     21      0      0   9001   2520 
0    177     51      -     37     27      0     25      0      0   9001   2520 
0    185     50      -     38     28      0     24      0      0   9001   2520 
0    173     51      -     37     24      0     17      0      0   9001   2520 
0    184     53      -     41     28      0     25      0      0   9001   2520 
0    174     51      -     34     22      0     21      0      0   9001   2520 
0    181     53      -     41     28      0     24      0      0   9001   2520 
0    185     54      -     41     28      0     25      0      0   9001   2520 
0    195     50      -     35     25      0     24      0      0   9001   2520 
0    172     51      -     39     27      0     19      0      0   9001   2520 
0    173     51      -     39     26      0     21      0      0   9001   2520 
0    146     53      -     41     28      0     25      0      0   9001   2520 
0    183     53      -     41     28      0     25      0      0   9001   2520 
0    192     51      -     41     28      0     25      0      0   9001   2520 
0    185     51      -     36     24      0     25      0      0   9001   2520 
0    164     50      -     36     26      0     25      0      0   9001   2520 
0    161     55      -     42     28      0     25      0      0   9001   2520 
0    178     55      -     41     28      0     25      0      0   9001   2520 
0    185     55      -     41     28      0     25      0      0   9001   2520 
0    196     54      -     41     28      0     25      0      0   9001   2520 
0    183     55      -     41     28      0     25      0      0   9001   2520 
0    205     53      -     41     28      0     25      0      0   9001   2520 
0    198     51      -     41     28      0     25      0      0   9001   2520 
0    154     53      -     36     22      0     21      0      0   9001   2520 
0    187     53      -     41     28      0     24      0      0   9001   2520 
0    185     52      -     41     28      0     25      0      0   9001   2520 
0    176     51      -     41     28      0     24      0      0   9001   2520 
0    169     51      -     41     28      0     25      0      0   9001   2520 
0    171     51      -     41     28      0     24      0      0   9001   2520 
0    187     51      -     41     28      0     25      0      0   9001   2520 
0    183     51      -     39     26      0     19      0      0   9001   2520 
0    166     54      -     40     28      0     24      0      0   9001   2520 
0    192     51      -     38     25      0     19      0      0   9001   2520 

Btw, Is there any method that I can use/anything I can do to increase the SM percentage?

For the pipeline, the whole performance is decided by the slowest component in the pipeline. If you want the GPU to be used as much as possible, you need to make sure the source frames are fed to the model as much as possible. I don’t have enough information about every part of your pipeline. And I have listed some possible points(network bandwidth, postprocessing, grpc,…) in the previous posts. Please break down every part of your pipeline to find out the bottleneck of your pipeline first.

1 Like

You also need to measure your batch 128 TensorRT engine’s performance to check the model’s own limitation.

1 Like

Alright, thank you so much for your help, @Fiona.Chen !!

I just remember one more thing, I saw that my decoder with 44 15-fps RTSP streams, only reach 34%. Is this the maximum output already or also result of some problem in my pipeline? Is this also a bottleneck because of GPU type (L20)?

gpu    pwr  gtemp  mtemp     sm    mem    enc    dec    jpg    ofa   mclk   pclk

Idx      W      C      C      %      %      %      %      %      %    MHz    MHz

...

  0    158     54      -     45     30      0     22      0      0   9001   2520 
    0    173     52      -     51     36      0     28      0      0   9001   2520 
    0    194     51      -     52     36      0     32      0      0   9001   2520 
    0    221     51      -     45     33      0     34      0      0   9001   2520 
    0    196     51      -     38     29      0     30      0      0   9001   2520 
    0    168     55      -     29     20      0     18      0      0   9001   2520 

...

I am currently just testing how many streams i can add until the pipeline collapse.

Please use “trtexec” to test your model engine’s performance.

This means that your model is not the pipeline’s bottleneck. Either the upstream can’t feed as much as tensor data to the model in time or the downstream can’t consume the output tensor as soon as possible. You need to find out which component is the bottleneck.

I overcome this by reducing the number of batch-push-timeout to 1/5 of recommended. Way lower than what the official guide asks me to, which is 1/max-fps. I also turn off sync and adaptive batching. So far I did manage to get 41 streams for 20 fps, 55 streams for 15fps, and 75 streams for 10 fps. However, I notice for all of these tests, the SM% and dec% never hit 100% before the pipeline crashes and unable to add more streams. On every cases, they will reach 56% and around 32% respectively before then drops immediately to 0.

Any advice?