Scaling problem using Triton server and RTSP Multi-stream

• GPU RTX 3060
• DeepStream Version 6.4
• NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2

I’m running multi-stream RTSP tests with a separate Triton server and gRPC.

By increasing the number of streams to 4 (using sample deepstream-rtsp-in-rtsp-out), I can’t get real-time results. The time displayed on the RTSP output lags behind the time displayed on the original camera stream. Here’s an image of the result:

I’ve noticed that the delays are growing somewhat randomly. There can be a delay of up to 4-5 minutes.

None of my hardware components seem to be overloaded (CPU OK, RAM OK, GPU 2GB used while 6 are available).

Could you explain to me why this scaling problem?

Here are the steps to reproduce :

  1. Download folder with files
    triton_model_repo.zip (7.8 MB)

  2. Create network

docker network create ds-network
  1. Start triton server (update triton_model_repo volume path) in CONTAINER 1
docker run --gpus '"'device=0'"' -it --rm -v /tmp/.X11-unix:/tmp/.X11-unix -e DISPLAY=$DISPLAY -v ~/Downloads/triton_model_repo:/models --net=ds-network --hostname=triton-server nvcr.io/nvidia/deepstream:6.4-gc-triton-devel tritonserver --model-repository=/models
  1. Start CONTAINER 2 and prepare environment :
xhost +
docker run --gpus all -it --rm --privileged -v /tmp/.X11-unix:/tmp/.X11-unix -v ~/Downloads/triton_model_repo:/models -e DISPLAY=$DISPLAY -w /opt/nvidia/deepstream/deepstream-6.4 --net=ds-network nvcr.io/nvidia/deepstream:6.4-gc-triton-devel 
./user_deepstream_python_apps_install.sh --build-bindings -r master && cp /models/people_nvidia_detector/dstest1_pgie_inferserver_config.txt /models/people_nvidia_detector/labels.txt sources/deepstream_python_apps/apps/deepstream-rtsp-in-rtsp-out/ && cd sources/deepstream_python_apps/apps/deepstream-rtsp-in-rtsp-out/
  1. Run sample in CONTAINER 2 with multiple RTSP streams:
python3 deepstream_test1_rtsp_in_rtsp_out.py -i rtsp://1 rtsp://2 rtsp://3 rtsp://4 -g nvinferserver

It may be affected by the following aspects.

  1. Please check the configuration of this value, deepstream_test1_rtsp_in_rtsp_out.py.
  2. Please set the idrintervalparameter to the encoder. Currently the parameter of encoder is only set on the jetson. You can modify that and set the idrinterval to 30. deepstream_test1_rtsp_in_rtsp_out.py#L255.
  3. You can also try to use new nvstreammux to get more precise control of the wait for data in the batch. You can refer to our open source code samples\configs\deepstream-app\config_mux_source4.txt to refer to how config that.

Am I supposed to define idrinterval that way :

    # Make the encoder
    if codec == "H264":
        encoder = Gst.ElementFactory.make("nvv4l2h264enc", "encoder")
        print("Creating H264 Encoder")
    elif codec == "H265":
        encoder = Gst.ElementFactory.make("nvv4l2h265enc", "encoder")
        print("Creating H265 Encoder")
    if not encoder:
        sys.stderr.write(" Unable to create encoder")
    encoder.set_property("bitrate", bitrate)
    encoder.set_property("idrinterval", 30)
    if is_aarch64():
        encoder.set_property("preset-level", 1)
        encoder.set_property("insert-sps-pps", 1)
        #encoder.set_property("bufapi-version", 1)

It is not solving the issue…

Yes. So this problem can only be analyzed step by step where the delay was caused. Could you do the following experiments to find out that?

  1. Remove the rtsp out ralated plugins and display the result directly on the screen like deepstream_test_1.py, ....nvdsosd->nveglglessink.

  2. Display the rtsp source directly on the screen and observe the delay, like uridecoderbin->nveglglessink.

  3. Replace nvinferserver plugin with nvinfer for a comparison

Thank you for these instructions.
Instructions 1 and 2 also generate the delays
Instruction 3 does not generate a delay.
Can you provide instructions to help me with multi-stream scaling using nvinferserver?

It’s werid that in theory the 2nd should not cause any delay. It just decodes your rtsp source and displays it directly.
And waht do your mean by multi-stream scaling using nvinferserver? The multi-stream will first form a batch in nvsterammux and then scaled according to the dimensions for your model in the config file.

Given the test results, everything points to a problem with nvinferserver.
Why, when I increase the number of input streams, do these delays appear?
Thank you

Could you refer to our FAQ to compare the latency between nvinfer and nvinferserver?

Thank you for your reply.

I have performed tests with deepstream-test3 sample.

I have noticed that the following lines :

        if source_element.find_property('drop-on-latency') != None:
            Object.set_property("drop-on-latency", True)

drop frames at pipeline input if the pipeline is subject to latency. This property allows me to have a display output whose clock is synchronized with that of the original camera feed (when using nvinfer and nvinferserver). This seems logical. When these lines are commented, I again get this latency that grows over time (for nvinferserver). However, no latency appears when I use nvinfer (when these lines are commented).

Here’s the modified sample file:
deepstream_test_3.txt (18.8 KB)

Here are the results for deepstream-test3 with latency display:

Using nvinferserver grpc (I still use the same config file as the one provided in the initial post (dstest1_pgie_inferserver_config.txt):

python3 deepstream_test_3.py -i rtsp://1 rtsp://2 rtsp://3 rtsp://4 rtsp://5 rtsp://6 --pgie nvinferserver -c config_triton_grpc_infer_primary_peoplenet.txt


Using nvinfer

python3 deepstream_test_3.py -i rtsp://1 rtsp://2 rtsp://3 rtsp://4 rtsp://5 rtsp://6 --pgie nvinfer -c config_infer_primary_peoplenet.txt


What do you think of these latency results?
I have the impression that nothing justifies latencies of several minutes…

Should the queues also be analyzed?
If so, how?

Thanks

The drop-on-latency is used in conjunction with latency.

It is possible that too many queues in the pipeline are caching buffers. You can try to remove the queue plugin and check.

Also you can try to use the CAPI mode instead of the GRPC mode and check the latency.

Does your use case have to be in GRPC mode?

Hello,

Thank you for your answers.

Removing the queues doesn’t solve the problem unfortunately.

Using CAPI mode instead of gRPC doesn’t solve the problem either.

I did however make a few observations:

  • I use peoplenet in onnx format (using onnx_runtime). I’ve analyzed the GPU information. Although the memory is only partially used, its utilization fluctuates between 70-98% (using nvidia-smi). Since this value fluctuates, I can’t decide whether it’s the bottleneck… Is there another way of checking this more reliably? When I reduce to 3 simultaneous streams, GPU utilization hovers around 60-95%. In this case, there’s no delay.
  • When I use peoplenet with TensorRT, GPU utilization is drastically lower (hovering around 35% for 10 simultaneous streams). So there’s no delay in this case. How is it possible to have such a big difference? I don’t know if TensorRT offers increased optimization, but it seems very significant in this case…

I’ve also done some tests to monitor the triton server queue. Whether I’m in a situation where latency is growing or not, the queue continues to grow infinitely. How is it possible for a queue to grow forever when no latency is observed at the RTSP output?
You can replicate this using deepstream-test3 with a Triton gRPC server (peoplenet tensorRT).
Here is the python code I use to moniter the size of the Triton server queue:
analyze_triton.txt (1.1 KB)

python3 analyze_triton.py triton_ip_adress

Thanks for your help

This means that there is a high probability that the performance of your gpu card is a problem.

You mean use our etlt model? Could you attach the link of the model and how do you get the onnx model?

If the problem is the performance of the gpu card, you can configure the interval parameter to try.

The model file is attached to the first Post in triton_model_repo.zip (7.8 MB)

The onnx comes from NGC catalogue (PeopleNet | NVIDIA NGC) : pruned_quantized_decrypted_v2.3.3

The TensorRT engine file has been generated from the onnx file on my RTX3060.

Thanks for this :
If the problem is the performance of the gpu card, you can configure the interval parameter to try.
Can you define what this parameter is used for, where does it need to be defined and how?

But I want to understand, not get around the problem.

Can you answer that? :
I’ve also done some tests to monitor the triton server queue. Whether I’m in a situation where latency is growing or not, the queue continues to grow infinitely. How is it possible for a queue to grow forever when no latency is observed at the RTSP output?
You can replicate this using deepstream-test3 with a Triton gRPC server (peoplenet tensorRT).
Here is the python code I use to moniter the size of the Triton server queue:
analyze_triton.txt (1.1 KB)

python3 analyze_triton.py triton_ip_adress

Thanks

When your GPU performance is not very good, you can skip some frames and not do inference.
Just set that in your config file: dstest1_pgie_inferserver_config.txt

input_control {
  process_mode: PROCESS_MODE_FULL_FRAME
  operate_on_gie_id: -1
  interval: 5
}

So you just want to know why does etlt model perform better than onnx and why the queue grow infinitely?

Thanks for answer 1. Is it better to user interval or drop-frame-interval?

So you just want to know why does etlt model perform better than onnx and why the queue grow infinitely?

I know that tensorrt is more efficient than onnnx runtime.
But when using a etlt model, without noticing any latency on display output, why am I observing Triton serveur queue growing infinitely?

Both. The interval is configured for nvinferserver I attached. The drop-frame-interval is configured for the source. You can refer to our demo code deepstream_rt_src_add_del.py.

The queue count is typically monotonically increasing counters to help compute an average over time (total latency / total count). This value is expected to grow, and does not reflect the “current” queue size. The Pending Request Count is referring to is a closer approximation of that.

Is the latency not improved after you use CAPI mode and the two parameters I provided? Could you attach the modified config file for CAPI mode?

In accordance with the script provided here :
analyze_triton.txt
How to modify these lines :

        stats = triton_client.get_inference_statistics(model_name="people_nvidia_detector")
        queue_count = stats['model_stats'][0]['inference_stats']['queue']['count']

to obtain Pending Request Count (current queue size) ?

What do you mean by CAPI exactly ? Basically using Triton server without gRPC ?

Thanks

Yes. config_triton_infer_primary_peoplenet.txt is the CAPI mode without grpc. config_triton_grpc_infer_primary_peoplenet.txt is the grpc mode.

Also did you set the drop-frame-interval paramter and interval parameter for nvinferserver?

I will try that tomorrow thanks.

Can you answer the other part please?
Pending Request Count

You can get the metric with the curl command like:

$ curl <your_ip>:8002/metrics

Then check the nv_inference_pending_request_count field from the response.