Analyzing latency using Nsight systems in deepstream

Hardware Platform:jetson orin nano 4G
DeepStream Version :6.2

I use DeepStream YOLO to quantify the YOLOv5S INT8 model, and the theoretical fps can reach 120. When I run the YOLOv5 object detection model with 4 RTSP streams in DeepStream, the running delay increases and the image becomes distorted.

I suspect that there is a delay in my model inference. I used Nsight systems to run the deepstream pipeline and received a report as follows:


Some of my batches are 4ms, while others are close to 500ms


From the above figure, it can be seen that some layers of my model took a lot of time, which is a situation that has never occurred in single RTSP stream detection,and these layers may all be different.

If there is a problem with my model, there may also be related issues with single RTSP stream detection, but in reality, this delay has not occurred. I have provided the overall report file above. Can you help me analyze where the problem lies?

This is my deepstream test code, accompanied by an onnx model. The post-processing is generated using deepstream yolo(GitHub - marcoslucianops/DeepStream-Yolo: NVIDIA DeepStream SDK 7.1 / 7.0 / 6.4 / 6.3 / 6.2 / 6.1.1 / 6.1 / 6.0.1 / 6.0 / 5.1 implementation for YOLO models)

How did you get the data? What is the batch size?

Can you post the deepstream-app configuration file and the nvinfer configuration file?

@Fiona.Chen
I tested it using trtexec – loadEngine=xxx. engine and it showed Throughput on Jetson Orin Nano 4G: 123.148 QPS, my model file and deepstream code are as follows:

The Nsight systems report is a batch size=4 report.

I mean what is the TensorRT engine’s batch size?

@Fiona.Chen
Dynamic batch, min=1 opt =4 max=8

Then the engine batch size is 8. Can you post the whole log of the trtexec measurement?

And please post the deepstream-app configuration file and the nvinfer configuration file. Thank you!

@Fiona.Chen
Sorry, I didn’t notice the allocation of link permissions earlier.

This is my pipeline and configuration code

This is the report for batch size 4

Can you post the whole log of the trtexec measurement? Not the Nsight log

@Fiona.Chen
trtexec.txt (16.8 KB)

The trtexec command is wrong. Your model support dynamic batch, right?
How did you generate the TensorRT engine?

@Fiona.Chen
Right, min=1, opt=4, max=8

Why did you set the nvinfer batch size as 1?

@Fiona.Chen
I realized this morning that I had originally intended to set the batch size of nvinfer to num_rtp. I conducted another experiment this morning and changed the batch size of nvinfer to num_rtp. However, I found that the inference time was still too long. I will share the report with you later

The postprocessing seems CPU version, have you tested the performance of the postprocessing?

The GitHub - marcoslucianops/DeepStream-Yolo: NVIDIA DeepStream SDK 7.1 / 7.0 / 6.4 / 6.3 / 6.2 / 6.1.1 / 6.1 / 6.0.1 / 6.0 / 5.1 implementation for YOLO models is not provided by Nvidia, we does not guarantee the performance.

What is the original FPS of your RTSP sources? Are they all TCP protocol? Have you set enough latency with rtspsrc?

@Fiona.Chen
My RTSP stream has a maximum FPS of 25 and uses nvurisrcbin for decoding. The specific parameters are as follows:

g_object_set(G_OBJECT(uri_decode_bin), "uri", uri, NULL);
g_object_set(G_OBJECT(uri_decode_bin), "rtsp-reconnect-interval", 60, NULL);
g_object_set(G_OBJECT(uri_decode_bin), "num-extra-surfaces", 8, NULL);
g_object_set(G_OBJECT(uri_decode_bin), "udp-buffer-size", 524288 * 2, NULL);
g_object_set(G_OBJECT(uri_decode_bin), "latency", 150, NULL);

The select rtp protocol parameter in nvurisrcbin is used by default
(0): rtp-multi - UDP + UDP Multicast + TCP

nvstreammux parameters are as follows:

g_object_set(G_OBJECT(streammux), "batch-size", rtsp_number, NULL);
g_object_set(G_OBJECT(streammux), "interpolation-method", 5, NULL);
g_object_set(G_OBJECT(streammux), "compute-hw", 2, NULL);
g_object_set(G_OBJECT(streammux), "width", 640, "height", 640, NULL);
g_object_set(G_OBJECT(streammux), "live-source", TRUE, NULL);
g_object_set(G_OBJECT(streammux), "buffer-pool-size", 16, NULL);
g_object_set(G_OBJECT(streammux), "batched-push-timeout", 40000, NULL);

@Fiona.Chen

In Nsight log:

batch size =1  queueInput max Duration=16.180ms
batch size =2  queueInput max Duration=324.4ms
batch size =4  queueInput max Duration=622.355ms

I found that most of the time is spent on a certain node in TensorRT, as shown in the following figure


But these layers have randomness, and I am not sure about their pattern yet, but in the case of batch 1, there is not such a big difference.

Also, I don’t know how to debug deepstream on Nsight systems. I’m just a beginner, as shown in the red box in the figure below, which represents one batch. This batch took more than 500 ms, and most of the time was occupied by pthread_comd_wait. Is this related to the configuration of framework element parameters?

What is your final purpose with checking latency by Nsight?

The gst-nvinfer plugin is open source. The source code is in /opt/nvidia/deepstream/deepstream/sources/libs/nvdsinfer and /opt/nvidia/deepstream/deepstream/sources/gst-plugins/gst-nvinfer. There is also source code diagram for your reference. DeepStream SDK FAQ - Intelligent Video Analytics / DeepStream SDK - NVIDIA Developer Forums

As to the source code, you may know the “queueInputBatch” includes the preprocess(convert the image data to Tensor data), TensorRT inferencing and postprocess(parsing the output Tensor data and calculating the output bboxes). As your inputs, you are using a batch size 1 TensorRT engine for inferencing, the performance is around 120FPS. The preprocess is accelerated by GPU and VIC. The postprocessing is customized by the third party and run on CPU. The total performance of preprocess+inferencing+postprocess is less than 120 FPS and we don’t know the exact data.

@Fiona.Chen
The purpose of my inspection is that my deepstream application has latency and distortion, and GST_DEBUG=3 keeps displaying <nvv4l2decoder0>Decoder is producing too many buffers.

I used the above model to theoretically infer more than 100 frames per second, but in my application, I connected two RTSP streams and set the live source in streammux to True. The interval of the Nvidian inference plugin was set to 25 (my RTSP stream has a fps of 25), which means that the model detects one frame every 25 frames. This means that the model detects two frames per second (two RTSP streams), and there is a significant delay and distortion in detecting videos (a part of the picture is blurred).

Adding nvurisrcbin keeps alerting <nvv4l2decoder0>Decoder is producing too many buffers, therefore, I suspect that my model detection speed is very slow.