High Latency in Gst-nvinfer When Using DLA vs. GPU

Hello,

I am analyzing my DeepStream pipeline using NVIDIA Nsight Systems and have noticed something unexpected.

For my model:

On GPU (FP32): TensorRT inference takes 1.9 - 2.1 ms.
On DLA (FP32) : TensorRT inference is much faster , taking only 600 - 750 microseconds (0.6 - 0.75 ms) .

Since my model is fully supported by DLA, I expected a significant reduction in total processing time when using it. However, the opposite happens:

Total Gst-nvinfer latency on GPU : 3.8 - 4 ms
Total Gst-nvinfer latency on DLA : 7.3 - 7.5 ms

This means that despite much faster inference on DLA, the overall Gst-nvinfer processing time is nearly twice as long as on the GPU.

Possible Causes
From my Nsight Systems profiling, it appears that the additional latency on DLA might be due to:

  1. cudaEventSynchronize() overhead – This function may be blocking execution until the DLA finishes processing.

  2. dequeueOutputAndAttachMeta() overhead – Retrieving inference results from the DLA and attaching metadata to buffers might be taking extra time.

Questions for Optimization

  1. Why does Gst-nvinfer take so much longer on DLA despite its faster inference time?

  2. Is there a way to reduce synchronization overhead (cudaEventSynchronize())?

  3. Does memory transfer from DLA to GPU introduce extra latency? If so, can it be optimized using pinned memory or asynchronous copies?

  4. Could DeepStream’s queueing mechanism be affecting DLA latency? Would increasing batch size help mitigate this?

  5. Are there best practices for optimizing dequeueOutputAndAttachMeta() when using DLA?

Here are my profiling results:

Total Gst-nvinfer time on GPU:

Total Gst-nvinfer time on DLA:

Here are TensorRT execution times on DLA:

Can you share your model and the nvinfer configurations for GPU and DLA?

@Fiona.Chen sure, so here is my ONNX model:
model_nchw.onnx.zip (1.0 MB)
and here are my configuration files:

However, there is not much of a difference between them, the only difference is:

enable-dla=1
use-dla-core=0

I have tried to examine the inference speed with Nsight System again, and here are the results, after setting max power mode and jetson_clocks:

sudo nvpmodel -m 0
sudo jetson_clocks

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU)
• DeepStream Version
• JetPack Version (valid for Jetson only)
• TensorRT Version
• NVIDIA GPU Driver Version (valid for GPU only)

Can you provide the label file too?

@Fiona.Chen oh, sorry. So here is the label file for both of them:
labels.txt (5 Bytes)

Why don’t you tell us the following information?
• Hardware Platform (Jetson / GPU)
• DeepStream Version
• JetPack Version (valid for Jetson only)
• TensorRT Version
• NVIDIA GPU Driver Version (valid for GPU only)

@Fiona.Chen I’m so sorry. I have not noticed that i did not include it. Here is the information:
• Hardware Platform (Jetson / GPU) : NVIDIA Jetson AGX Orin
• DeepStream Version : 7.1
• JetPack Version (valid for Jetson only) : 6.1
• TensorRT Version : 8.6.2.3
• Issue Type( questions, new requirements, bugs) : question

I’ve tested with TensorRT tool trtexec on AGX Orin.

/usr/src/tensorrt/bin/trtexec --loadEngine=model_nchw.onnx_b1_dla0_fp32.engine --useDLACore=0

Average on 10 runs - GPU latency: 8.85653 ms

 /usr/src/tensorrt/bin/trtexec --loadEngine=model_nchw.onnx_b1_gpu0_fp32.engine

Average on 10 runs - GPU latency: 3.30365 ms

The TensorRT inferencing time is aligned with The nsys logs you provided.

@Fiona.Chen Than you very much for testing this. However, I do not understand why DLA is slower than GPU. Isn’t it supposed to be faster or this is not always the case?

Here are my questions I asked at the beginning:
Questions for Optimization

  1. Why does Gst-nvinfer take so much longer on DLA despite its faster inference time?
  2. Is there a way to reduce synchronization overhead (cudaEventSynchronize())?
  3. Does memory transfer from DLA to GPU introduce extra latency? If so, can it be optimized using pinned memory or asynchronous copies?
  4. Could DeepStream’s queueing mechanism be affecting DLA latency? Would increasing batch size help mitigate this?
  5. Are there best practices for optimizing dequeueOutputAndAttachMeta() when using DLA?
  6. Could it be because of not optimized architecture for DLA? What kind of model architectures work faster and better on DLA compared to GPU?

When generate DLA engine from your model, I got “WARNING: [TRT]: Layer ‘Resize__76’ (RESIZE): Unsupported on DLA. Switching this layer’s device type to GPU.” log from TensorRT. That means the model is not completely DLA compatible, some layers may be built to run on GPU but not DLA. The switch between DLA layer and GPU layer may introduce extra effort when inferencing.

What do you mean by this? gst-nvinfer includes preprocessing, inferencing and postprocessing, why do you only count the inferencing time?

The cudaEventSynchronize waits until the completion of all work currently captured in event.

The DLA latency is a part of the batch processing latency.

If you handles multiple streams, to set batch size the same as the stream number may help.

No.

What are you talking about?

Please refer to Working with DLA — NVIDIA TensorRT Documentation

The TensorRT forum may provide more information for your questions. TensorRT - NVIDIA Developer Forums

1 Like

@Fiona.Chen Thank you for your answer. I will try to make the model optimized for DLA so that there are no layers that are being switched to GPU but stay on DLA.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.