Optimizing DeepStream Pipeline: Reducing RetinaNet Latency on Jetson AGX Orin

szymon.budziak.td · April 3, 2025, 10:58am

• Hardware Platform (Jetson / GPU) : NVIDIA Jetson AGX Orin
• DeepStream Version : 7.1
• JetPack Version (valid for Jetson only) : 6.1
• TensorRT Version : 8.6.2.3
• Issue Type( questions, new requirements, bugs) : question
Hello,

I am running a DeepStream pipeline on Jetson AGX Orin, where I have two inference branches:

The upper branch runs efficiently, and the entire pipeline reaches the sink in less than 16 ms.
The lower branch contains a RetinaNet object detection model with a ResNet50 backbone, which takes approximately 100 - 120 ms per inference.

When building this pipeline I followed this structure to make inferences parallel: Nvidia Parallel Inference.
Due to this significant inference time, the object detection model introduces substantial latency, especially when its outputs are muxed using NvDsMetaMux. Additionally, monitoring with jtop shows that GPU usage is at 100%.

Running the RetinaNet model in isolation (in a separate pipeline with this model only) still results in near 100% GPU utilization, even with boosted clocks (jetson_clocks on).

Increasing the interval of RetinaNet to 10 reduces its inference frequency, but every time it runs, the inference time of other models increases to ~30 ms and GPU jumps from 20% usage to 100%. Here is the result from Nsys profiling:

This image shows how the model that runs its inference in 6 ms increases it to even 60 ms when the RetinaNet model performs inference at the same time.

Here is nsys-rep file with analysis:
retina_latency.nsys-rep.zip (3.1 MB)

Questions & Potential Solutions

Can the upper inference branch be processed independently, without delays introduced by the lower RetinaNet branch?
Are there any optimizations to reduce the high latency of RetinaNet?

I plan to test INT8 precision, but I doubt it will bring inference time below 16 ms.
Would changing the model architecture (e.g., a lighter object detector) be the only viable solution?
Are there any DeepStream-specific optimizations (e.g., pipeline modifications, queue settings, GPU scheduling) that could help?

Any guidance on optimizing this pipeline or reducing RetinaNet’s inference latency would be greatly appreciated!

Fiona.Chen · April 7, 2025, 2:10am

How did you get such data? Have you measured the model’s performance with “trtexec” tool?

Seems the model is too heavy for AGX Orin GPU.

What is the model of your first branch? If the whole model apply with Working with DLA — NVIDIA TensorRT Documentation, you may consider to run the model in DLA instead of GPU. (BTW: The DLA may be slower than GPU)

Fiona.Chen · April 7, 2025, 2:14am

This is DeepStream forum, you may ask the model pruning, quantation,… related questions in TAO forum Latest Intelligent Video Analytics/TAO Toolkit topics - NVIDIA Developer Forums

szymon.budziak.td · April 7, 2025, 6:20am

@Fiona.Chen Thank you for your reply.

So i did benchmarking with trtexec tool following this DeepStream Best Practices and these are the results I got:
Batch size 1, Precision FP32

=== Performance summary ===
Throughput: 9.47074 qps
Latency: min = 105.46 ms, max = 105.713 ms, mean = 105.587 ms, median = 105.592 ms

Batch size 8, Precision FP32

=== Performance summary ===
Throughput: 1.31233 qps
Latency: min = 761.442 ms, max = 764.165 ms, mean = 762.003 ms, median = 761.746 ms

Batch size 1, Precision FP16

=== Performance summary ===
Throughput: 19.9816 qps
Latency: min = 49.9731 ms, max = 50.7459 ms, mean = 50.045 ms, median = 50.0352 ms

Batch size 8, Precision FP16

=== Performance summary ===
Throughput: 2.64337 qps
Latency: min = 376.196 ms, max = 382.049 ms, mean = 378.304 ms, median = 376.544 ms

However, I have not tried INT8 but still this model won’t make it in less than 16 ms.
I also tested it in DeepStream pipeline and FP16 with batch size 1 has latency around 120 ms with postprocessing.

The model of the first branch is a classifier. However, my object detection model without any other models has similar performance of 100-120 ms latency per inference.

Fiona.Chen · April 7, 2025, 9:28am

If the whole model apply with Working with DLA — NVIDIA TensorRT Documentation, you may consider to run the model in DLA instead of GPU. (BTW: The DLA may be slower than GPU)

Fiona.Chen · April 7, 2025, 9:28am

szymon.budziak.td:

Batch size 1, Precision FP16

=== Performance summary ===
Throughput: 19.9816 qps
Latency: min = 49.9731 ms, max = 50.7459 ms, mean = 50.045 ms, median = 50.0352 ms

Batch size 8, Precision FP16

=== Performance summary ===
Throughput: 2.64337 qps
Latency: min = 376.196 ms, max = 382.049 ms, mean = 378.304 ms, median = 376.544 ms

The performance is too bad to run in AGX Orin. For model optimization, you may consult in TAO toolkit forum: TAO Toolkit - NVIDIA Developer Forums

system · May 13, 2025, 8:07am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Tensorflow running very slow on Nvidia Jetson AGX Orin Jetson AGX Orin tensorflow	3	65	March 4, 2025
Unexpected Delay When Setting Model Interval > 0 in Custom RetinaNet Pipeline DeepStream SDK jetson-inference , python , deepstream , jetson-orin	23	101	June 3, 2025
Running Inference on AGX GPU Jetson AGX Orin tensorrt	7	963	July 4, 2024
Examples for Deployment of and Inference with Pretrained Custom PyTorch-Based Models on Jetson Orin Nano Jetson Orin NX pytorch	13	137	May 25, 2025
DLA performance less (around half) than what's expected Jetson AGX Orin dla	6	168	December 9, 2024
AGX Orin - Optimisation of GPU usage Jetson AGX Orin gpu-computing	20	991	May 22, 2024
Maximizing Deep Learning Performance on NVIDIA Jetson Orin with DLA Technical Blog	0	394	August 16, 2023
Compute time in DLA slower than expected Jetson AGX Orin dla	5	956	July 28, 2023
Profile results of model running on DLA mismatch between TensorRT and nsys Jetson AGX Orin tensorrt , dla	10	1089	April 5, 2023
How to maximize inferences/sec in a deepstream pipeline DeepStream SDK	13	1108	October 12, 2021

Optimizing DeepStream Pipeline: Reducing RetinaNet Latency on Jetson AGX Orin

Questions & Potential Solutions

Related topics