Analyzing latency using Nsight systems in deepstream

LFYTMLY · April 8, 2025, 9:36am

Hardware Platform：jetson orin nano 4G
DeepStream Version ：6.2

I use DeepStream YOLO to quantify the YOLOv5S INT8 model, and the theoretical fps can reach 120. When I run the YOLOv5 object detection model with 4 RTSP streams in DeepStream, the running delay increases and the image becomes distorted.

I suspect that there is a delay in my model inference. I used Nsight systems to run the deepstream pipeline and received a report as follows:

Some of my batches are 4ms, while others are close to 500ms

From the above figure, it can be seen that some layers of my model took a lot of time, which is a situation that has never occurred in single RTSP stream detection,and these layers may all be different.

If there is a problem with my model, there may also be related issues with single RTSP stream detection, but in reality, this delay has not occurred. I have provided the overall report file above. Can you help me analyze where the problem lies?

This is my deepstream test code, accompanied by an onnx model. The post-processing is generated using deepstream yolo(GitHub - marcoslucianops/DeepStream-Yolo: NVIDIA DeepStream SDK 7.1 / 7.0 / 6.4 / 6.3 / 6.2 / 6.1.1 / 6.1 / 6.0.1 / 6.0 / 5.1 implementation for YOLO models)

Fiona.Chen · April 9, 2025, 1:02am

How did you get the data? What is the batch size?

Can you post the deepstream-app configuration file and the nvinfer configuration file?

LFYTMLY · April 9, 2025, 1:09am

@Fiona.Chen
I tested it using trtexec – loadEngine=xxx. engine and it showed Throughput on Jetson Orin Nano 4G: 123.148 QPS, my model file and deepstream code are as follows:

The Nsight systems report is a batch size=4 report.

Fiona.Chen · April 9, 2025, 1:29am

I mean what is the TensorRT engine’s batch size?

LFYTMLY · April 9, 2025, 1:30am

@Fiona.Chen
Dynamic batch, min=1 opt =4 max=8

Fiona.Chen · April 9, 2025, 6:08am

Then the engine batch size is 8. Can you post the whole log of the trtexec measurement?

Fiona.Chen · April 9, 2025, 6:09am

And please post the deepstream-app configuration file and the nvinfer configuration file. Thank you!

LFYTMLY · April 9, 2025, 6:45am

@Fiona.Chen
Sorry, I didn’t notice the allocation of link permissions earlier.

This is my pipeline and configuration code

This is the report for batch size 4

Fiona.Chen · April 9, 2025, 6:57am

Can you post the whole log of the trtexec measurement? Not the Nsight log

LFYTMLY · April 9, 2025, 7:10am

@Fiona.Chen
trtexec.txt (16.8 KB)

Fiona.Chen · April 9, 2025, 7:13am

The trtexec command is wrong. Your model support dynamic batch, right?
How did you generate the TensorRT engine?

LFYTMLY · April 9, 2025, 7:16am

@Fiona.Chen
Right, min=1, opt=4, max=8

Fiona.Chen · April 9, 2025, 7:16am

Why did you set the nvinfer batch size as 1?

LFYTMLY · April 9, 2025, 7:20am

@Fiona.Chen
I realized this morning that I had originally intended to set the batch size of nvinfer to num_rtp. I conducted another experiment this morning and changed the batch size of nvinfer to num_rtp. However, I found that the inference time was still too long. I will share the report with you later

Fiona.Chen · April 9, 2025, 7:21am

The postprocessing seems CPU version, have you tested the performance of the postprocessing?

The GitHub - marcoslucianops/DeepStream-Yolo: NVIDIA DeepStream SDK 7.1 / 7.0 / 6.4 / 6.3 / 6.2 / 6.1.1 / 6.1 / 6.0.1 / 6.0 / 5.1 implementation for YOLO models is not provided by Nvidia, we does not guarantee the performance.

Fiona.Chen · April 9, 2025, 7:24am

What is the original FPS of your RTSP sources? Are they all TCP protocol? Have you set enough latency with rtspsrc?

LFYTMLY · April 9, 2025, 7:31am

@Fiona.Chen
My RTSP stream has a maximum FPS of 25 and uses nvurisrcbin for decoding. The specific parameters are as follows：

g_object_set(G_OBJECT(uri_decode_bin), "uri", uri, NULL);
g_object_set(G_OBJECT(uri_decode_bin), "rtsp-reconnect-interval", 60, NULL);
g_object_set(G_OBJECT(uri_decode_bin), "num-extra-surfaces", 8, NULL);
g_object_set(G_OBJECT(uri_decode_bin), "udp-buffer-size", 524288 * 2, NULL);
g_object_set(G_OBJECT(uri_decode_bin), "latency", 150, NULL);

The select rtp protocol parameter in nvurisrcbin is used by default
(0): rtp-multi - UDP + UDP Multicast + TCP

nvstreammux parameters are as follows：

g_object_set(G_OBJECT(streammux), "batch-size", rtsp_number, NULL);
g_object_set(G_OBJECT(streammux), "interpolation-method", 5, NULL);
g_object_set(G_OBJECT(streammux), "compute-hw", 2, NULL);
g_object_set(G_OBJECT(streammux), "width", 640, "height", 640, NULL);
g_object_set(G_OBJECT(streammux), "live-source", TRUE, NULL);
g_object_set(G_OBJECT(streammux), "buffer-pool-size", 16, NULL);
g_object_set(G_OBJECT(streammux), "batched-push-timeout", 40000, NULL);

LFYTMLY · April 9, 2025, 7:45am

@Fiona.Chen

In Nsight log:

batch size =1  queueInput max Duration=16.180ms
batch size =2  queueInput max Duration=324.4ms
batch size =4  queueInput max Duration=622.355ms

I found that most of the time is spent on a certain node in TensorRT, as shown in the following figure

But these layers have randomness, and I am not sure about their pattern yet, but in the case of batch 1, there is not such a big difference.

Also, I don’t know how to debug deepstream on Nsight systems. I’m just a beginner, as shown in the red box in the figure below, which represents one batch. This batch took more than 500 ms, and most of the time was occupied by pthread_comd_wait. Is this related to the configuration of framework element parameters?

Fiona.Chen · April 10, 2025, 2:54am

What is your final purpose with checking latency by Nsight?

The gst-nvinfer plugin is open source. The source code is in /opt/nvidia/deepstream/deepstream/sources/libs/nvdsinfer and /opt/nvidia/deepstream/deepstream/sources/gst-plugins/gst-nvinfer. There is also source code diagram for your reference. DeepStream SDK FAQ - Intelligent Video Analytics / DeepStream SDK - NVIDIA Developer Forums

As to the source code, you may know the “queueInputBatch” includes the preprocess(convert the image data to Tensor data), TensorRT inferencing and postprocess(parsing the output Tensor data and calculating the output bboxes). As your inputs, you are using a batch size 1 TensorRT engine for inferencing, the performance is around 120FPS. The preprocess is accelerated by GPU and VIC. The postprocessing is customized by the third party and run on CPU. The total performance of preprocess+inferencing+postprocess is less than 120 FPS and we don’t know the exact data.

LFYTMLY · April 10, 2025, 5:58am

@Fiona.Chen
The purpose of my inspection is that my deepstream application has latency and distortion, and GST_DEBUG=3 keeps displaying <nvv4l2decoder0>Decoder is producing too many buffers.

I used the above model to theoretically infer more than 100 frames per second, but in my application, I connected two RTSP streams and set the live source in streammux to True. The interval of the Nvidian inference plugin was set to 25 (my RTSP stream has a fps of 25), which means that the model detects one frame every 25 frames. This means that the model detects two frames per second (two RTSP streams), and there is a significant delay and distortion in detecting videos (a part of the picture is blurred).

Adding nvurisrcbin keeps alerting <nvv4l2decoder0>Decoder is producing too many buffers, therefore, I suspect that my model detection speed is very slow.

Topic		Replies	Views
Some question about Deep stream 5 DeepStream SDK	42	1784	October 12, 2021
Image Distortion and Inference Issues in Deepstream 6 Pipeline with YOLO v4 and UNet DeepStream SDK yolo , deepstream	18	491	June 11, 2024
Deepstreamer Pipeline: Optimisation GPU Utilisation DeepStream SDK gstreamer , fps , deepstream	22	97	December 12, 2024
Pose Estimation Runs Extremely Slow on Nano DeepStream SDK jetson-inference	6	707	October 12, 2021
Deepstream / Triton Server - YOLOv7 DeepStream SDK	8	3200	November 15, 2022
DeepStream 5 vs 6 inference time and calculate fps in the pipeline on Jetson Nano DeepStream SDK	9	2846	January 14, 2022
Scaling problem using Triton server and RTSP Multi-stream DeepStream SDK	39	1446	July 9, 2024
Multisession inference, segmentation DeepStream SDK jira	50	1325	July 3, 2024
Remove gstreamer pipeline buffering DeepStream SDK gstreamer , deepstream	15	1681	October 2, 2023
Yolov8 using deepstream jetson nano DeepStream SDK jetson , deepstream	13	116	October 17, 2024

Analyzing latency using Nsight systems in deepstream

Related topics