• Hardware Platform (Jetson / GPU) : NVIDIA Jetson AGX Orin
• DeepStream Version : 7.1
• JetPack Version (valid for Jetson only) : 6.1
• TensorRT Version : 8.6.2.3
• Issue Type( questions, new requirements, bugs) : question
Hello,
I am running a DeepStream pipeline on Jetson AGX Orin, where I have two inference branches:
- The upper branch runs efficiently, and the entire pipeline reaches the sink in less than 16 ms.
- The lower branch contains a RetinaNet object detection model with a ResNet50 backbone, which takes approximately 100 - 120 ms per inference.
When building this pipeline I followed this structure to make inferences parallel: Nvidia Parallel Inference.
Due to this significant inference time, the object detection model introduces substantial latency, especially when its outputs are muxed using NvDsMetaMux
. Additionally, monitoring with jtop shows that GPU usage is at 100%.
Running the RetinaNet model in isolation (in a separate pipeline with this model only) still results in near 100% GPU utilization, even with boosted clocks (jetson_clocks
on).
Increasing the interval of RetinaNet to 10 reduces its inference frequency, but every time it runs, the inference time of other models increases to ~30 ms and GPU jumps from 20% usage to 100%. Here is the result from Nsys profiling:
This image shows how the model that runs its inference in 6 ms increases it to even 60 ms when the RetinaNet model performs inference at the same time.
Here is nsys-rep file with analysis:
retina_latency.nsys-rep.zip (3.1 MB)
Questions & Potential Solutions
- Can the upper inference branch be processed independently, without delays introduced by the lower RetinaNet branch?
- Are there any optimizations to reduce the high latency of RetinaNet?
- I plan to test INT8 precision, but I doubt it will bring inference time below 16 ms.
- Would changing the model architecture (e.g., a lighter object detector) be the only viable solution?
- Are there any DeepStream-specific optimizations (e.g., pipeline modifications, queue settings, GPU scheduling) that could help?
Any guidance on optimizing this pipeline or reducing RetinaNet’s inference latency would be greatly appreciated!