Optimizing DeepStream Pipeline: Reducing RetinaNet Latency on Jetson AGX Orin

• Hardware Platform (Jetson / GPU) : NVIDIA Jetson AGX Orin
• DeepStream Version : 7.1
• JetPack Version (valid for Jetson only) : 6.1
• TensorRT Version : 8.6.2.3
• Issue Type( questions, new requirements, bugs) : question
Hello,

I am running a DeepStream pipeline on Jetson AGX Orin, where I have two inference branches:

  1. The upper branch runs efficiently, and the entire pipeline reaches the sink in less than 16 ms.
  2. The lower branch contains a RetinaNet object detection model with a ResNet50 backbone, which takes approximately 100 - 120 ms per inference.

When building this pipeline I followed this structure to make inferences parallel: Nvidia Parallel Inference.
Due to this significant inference time, the object detection model introduces substantial latency, especially when its outputs are muxed using NvDsMetaMux. Additionally, monitoring with jtop shows that GPU usage is at 100%.

Running the RetinaNet model in isolation (in a separate pipeline with this model only) still results in near 100% GPU utilization, even with boosted clocks (jetson_clocks on).

Increasing the interval of RetinaNet to 10 reduces its inference frequency, but every time it runs, the inference time of other models increases to ~30 ms and GPU jumps from 20% usage to 100%. Here is the result from Nsys profiling:

This image shows how the model that runs its inference in 6 ms increases it to even 60 ms when the RetinaNet model performs inference at the same time.

Here is nsys-rep file with analysis:
retina_latency.nsys-rep.zip (3.1 MB)

Questions & Potential Solutions

  1. Can the upper inference branch be processed independently, without delays introduced by the lower RetinaNet branch?
  2. Are there any optimizations to reduce the high latency of RetinaNet?
  • I plan to test INT8 precision, but I doubt it will bring inference time below 16 ms.
  • Would changing the model architecture (e.g., a lighter object detector) be the only viable solution?
  • Are there any DeepStream-specific optimizations (e.g., pipeline modifications, queue settings, GPU scheduling) that could help?

Any guidance on optimizing this pipeline or reducing RetinaNet’s inference latency would be greatly appreciated!

How did you get such data? Have you measured the model’s performance with “trtexec” tool?

Seems the model is too heavy for AGX Orin GPU.

What is the model of your first branch? If the whole model apply with Working with DLA — NVIDIA TensorRT Documentation, you may consider to run the model in DLA instead of GPU. (BTW: The DLA may be slower than GPU)

This is DeepStream forum, you may ask the model pruning, quantation,… related questions in TAO forum Latest Intelligent Video Analytics/TAO Toolkit topics - NVIDIA Developer Forums

@Fiona.Chen Thank you for your reply.

So i did benchmarking with trtexec tool following this DeepStream Best Practices and these are the results I got:
Batch size 1, Precision FP32

=== Performance summary ===
Throughput: 9.47074 qps
Latency: min = 105.46 ms, max = 105.713 ms, mean = 105.587 ms, median = 105.592 ms

Batch size 8, Precision FP32

=== Performance summary ===
Throughput: 1.31233 qps
Latency: min = 761.442 ms, max = 764.165 ms, mean = 762.003 ms, median = 761.746 ms

Batch size 1, Precision FP16

=== Performance summary ===
Throughput: 19.9816 qps
Latency: min = 49.9731 ms, max = 50.7459 ms, mean = 50.045 ms, median = 50.0352 ms

Batch size 8, Precision FP16

=== Performance summary ===
Throughput: 2.64337 qps
Latency: min = 376.196 ms, max = 382.049 ms, mean = 378.304 ms, median = 376.544 ms

However, I have not tried INT8 but still this model won’t make it in less than 16 ms.
I also tested it in DeepStream pipeline and FP16 with batch size 1 has latency around 120 ms with postprocessing.

The model of the first branch is a classifier. However, my object detection model without any other models has similar performance of 100-120 ms latency per inference.

If the whole model apply with Working with DLA — NVIDIA TensorRT Documentation, you may consider to run the model in DLA instead of GPU. (BTW: The DLA may be slower than GPU)

The performance is too bad to run in AGX Orin. For model optimization, you may consult in TAO toolkit forum: TAO Toolkit - NVIDIA Developer Forums

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.