GeMM performance on Orin DLA

akarmakar1 · November 24, 2023, 9:03am

I’ve been trying to offload some GeMM computations to DLA.
However, I find that the DLA computation for the operation as compared to the GPU is almost 100x slower.
Is this expected behaviour or is something going wrong here?

Testing methodology
Conversion of a 1024x1024 sized Pytorch nn linear layer exported to ONNX, converted via trtexec
The ONNX file and the trtexec data is also attached. Note : The ONNX file has a .txt extension because the forum wouldn’t let me upload a .onnx file.

trtexec_gemm_dla.txt (255.1 KB)
onnx_file.txt (4.0 MB)

Testing Environment
Jetson AGX Orin 32GB Jetpack 5.1.1
Power Mode - 50W

TomNVIDIA · November 24, 2023, 3:52pm

Hello,

I think this topic should be posted in the Jetson forums. I have moved it over for better visibility.

Tom

AastaLLL · November 27, 2023, 2:05am

Hi,

Which precision do you test?
If not INT8, could you give it a try?

Orin DLA is designed for low-precision inference.
So it can achieve more INT8 TOPs and lower FP16 FLOPs.

Thanks.

akarmakar1 · November 28, 2023, 9:52am

Hey,

Thanks for the reply.

So I tried INT8 and its significantly (near 2x) faster than FP16 on the DLA but its still significantly slower than the GPU .

I’ve attached the relevant trtexec generated runtime and conversion data.

GPU_INT8.txt (129.0 KB)
DLA_INT8.txt (125.6 KB)

AastaLLL · November 29, 2023, 4:14am

Hi,

[11/28/2023-14:48:00] [I] [TRT] ---------- Layers Running on DLA ----------
[11/28/2023-14:48:00] [I] [TRT] [DlaLayer] {ForeignNode[/attn/qkv/Gemm]}
[11/28/2023-14:48:00] [I] [TRT] ---------- Layers Running on GPU ----------
[11/28/2023-14:48:00] [I] [TRT] [GpuLayer] SHUFFLE: reshape_before_/attn/qkv/Gemm
[11/28/2023-14:48:00] [I] [TRT] [GpuLayer] SHUFFLE: reshape_after_/attn/qkv/Gemm

In your model, some of the layers run on DLA, and some run on GPU.
The data transfer between DLA and GPU can cause extra overhead.

Is it possible to modify the network architecture to make all the layers run on DLA?
You can find some examples in the below GitHub:

Thanks.

akarmakar1 · November 29, 2023, 7:18am

Hey,

According to the layerwise profile data those reshapes do not seem to be contributing much to the timings. I’ve attached the profile dump.

DLA_INT8_PROFILE.txt (8.9 KB)
GPU_INT8_PROFILE.txt (8.7 KB)

Thanks.

AastaLLL · November 30, 2023, 7:33am

Hi,

We need to check with our internal team.
Will share more info with you later.

Thanks.

akarmakar1 · December 8, 2023, 10:19am

Hi any updates on this?

I ran some more tests to see if the behaviour is consistent and it seems to be so.
Here are the results of those tests.
TRTEXEC_TIMINGS.txt (9.6 MB)

The data in tabular form

The ONNX files used to generate the above data

AastaLLL · December 11, 2023, 5:17am

Hi,

Our internal team is still checking this issue.
Due to the limited resources, it’s expected to take some time to get further info.

But we recently had a new JetPack release.
Would you mind giving it a try?

It’s JetPack 6.0DP with DLA 3.14.

Thanks.

system · January 16, 2024, 7:18am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

AastaLLL · February 21, 2024, 3:08am

Hi,

Here are some possible factors why the DLA throughput is much less than GPU:

1. The 2d GEMM(NxC, CxK) here gets translated to 4d Convolution1x1(NxCx1x1, KxCx1x1)
→ The single Conv node is surrounded by Reshapes/Transposes

2. DLA has lower available DRAM BW than GPU.

3. For multi-batched Convolution, the performance sweet spot is the batch size of <16 rather than the bs=1024 here

A much better perf benchmark for the DLA would be running, for example, 3x3 convolutions with input & output channels being multiples of 64, and a sufficiently large input resolution.

For FP16 and INT8 performance difference, please check below description for info as well:

Thanks.

Topic		Replies	Views
Keys to optimization a network on AGX Orin DLA for latency Jetson AGX Orin tensorrt , dla	2	884	October 6, 2023
Compute time in DLA slower than expected Jetson AGX Orin dla	5	941	July 28, 2023
Getting less throughput while enabling DLAs on Jetson AGX Orin Jetson AGX Orin dla	5	764	February 23, 2023
How to improve performance of Jetson Orin NX Jetson Orin NX dla	1	42	April 18, 2025
DLA-v2 is slower than DLA-v1 Jetson AGX Orin tensorrt , jetson-inference	8	2601	July 6, 2022
Jetson Orin AGX DLA does't works normal, infer speed is lower than without DLA Jetson AGX Orin dla	5	50	April 24, 2025
DLA performance less (around half) than what's expected Jetson AGX Orin dla	6	141	December 9, 2024
The Throughput is too slow in Nvidia jetson AGX ORin DLA Jetson AGX Orin cuda , cudnn , dla	4	498	January 31, 2024
Low performance while running model on DLA0, DLA1, and GPU at the same time on Jetson AGX Orin 64 GB Jetson Orin NX dla	7	979	February 14, 2023
Inference time of cuDLA on Jetson AGX Orin Jetson AGX Orin yolo , dla , jetson-orin	5	55	December 19, 2024

GeMM performance on Orin DLA

Related topics