Please provide complete information as applicable to your setup.
• Hardware Platform (Jetson / GPU) • DeepStream Version • JetPack Version (valid for Jetson only) • TensorRT Version • NVIDIA GPU Driver Version (valid for GPU only)
Why don’t you tell us the following information? • Hardware Platform (Jetson / GPU) • DeepStream Version • JetPack Version (valid for Jetson only) • TensorRT Version • NVIDIA GPU Driver Version (valid for GPU only)
@Fiona.Chen I’m so sorry. I have not noticed that i did not include it. Here is the information: • Hardware Platform (Jetson / GPU) : NVIDIA Jetson AGX Orin • DeepStream Version : 7.1 • JetPack Version (valid for Jetson only) : 6.1 • TensorRT Version : 8.6.2.3 • Issue Type( questions, new requirements, bugs) : question
@Fiona.Chen Than you very much for testing this. However, I do not understand why DLA is slower than GPU. Isn’t it supposed to be faster or this is not always the case?
Here are my questions I asked at the beginning: Questions for Optimization
Why does Gst-nvinfer take so much longer on DLA despite its faster inference time?
Is there a way to reduce synchronization overhead (cudaEventSynchronize())?
Does memory transfer from DLA to GPU introduce extra latency? If so, can it be optimized using pinned memory or asynchronous copies?
Could DeepStream’s queueing mechanism be affecting DLA latency? Would increasing batch size help mitigate this?
Are there best practices for optimizing dequeueOutputAndAttachMeta() when using DLA?
Could it be because of not optimized architecture for DLA? What kind of model architectures work faster and better on DLA compared to GPU?
When generate DLA engine from your model, I got “WARNING: [TRT]: Layer ‘Resize__76’ (RESIZE): Unsupported on DLA. Switching this layer’s device type to GPU.” log from TensorRT. That means the model is not completely DLA compatible, some layers may be built to run on GPU but not DLA. The switch between DLA layer and GPU layer may introduce extra effort when inferencing.
What do you mean by this? gst-nvinfer includes preprocessing, inferencing and postprocessing, why do you only count the inferencing time?
The cudaEventSynchronize waits until the completion of all work currently captured in event.
The DLA latency is a part of the batch processing latency.
If you handles multiple streams, to set batch size the same as the stream number may help.
@Fiona.Chen Thank you for your answer. I will try to make the model optimized for DLA so that there are no layers that are being switched to GPU but stay on DLA.