GeMM performance on Orin DLA

I’ve been trying to offload some GeMM computations to DLA.
However, I find that the DLA computation for the operation as compared to the GPU is almost 100x slower.
Is this expected behaviour or is something going wrong here?

Testing methodology
Conversion of a 1024x1024 sized Pytorch nn linear layer exported to ONNX, converted via trtexec
The ONNX file and the trtexec data is also attached. Note : The ONNX file has a .txt extension because the forum wouldn’t let me upload a .onnx file.

trtexec_gemm_dla.txt (255.1 KB)
onnx_file.txt (4.0 MB)

Testing Environment
Jetson AGX Orin 32GB Jetpack 5.1.1
Power Mode - 50W

Hello,

I think this topic should be posted in the Jetson forums. I have moved it over for better visibility.

Tom

Hi,

Which precision do you test?
If not INT8, could you give it a try?

Orin DLA is designed for low-precision inference.
So it can achieve more INT8 TOPs and lower FP16 FLOPs.

Thanks.

Hey,

Thanks for the reply.

So I tried INT8 and its significantly (near 2x) faster than FP16 on the DLA but its still significantly slower than the GPU .

I’ve attached the relevant trtexec generated runtime and conversion data.

GPU_INT8.txt (129.0 KB)
DLA_INT8.txt (125.6 KB)

Hi,

[11/28/2023-14:48:00] [I] [TRT] ---------- Layers Running on DLA ----------
[11/28/2023-14:48:00] [I] [TRT] [DlaLayer] {ForeignNode[/attn/qkv/Gemm]}
[11/28/2023-14:48:00] [I] [TRT] ---------- Layers Running on GPU ----------
[11/28/2023-14:48:00] [I] [TRT] [GpuLayer] SHUFFLE: reshape_before_/attn/qkv/Gemm
[11/28/2023-14:48:00] [I] [TRT] [GpuLayer] SHUFFLE: reshape_after_/attn/qkv/Gemm

In your model, some of the layers run on DLA, and some run on GPU.
The data transfer between DLA and GPU can cause extra overhead.

Is it possible to modify the network architecture to make all the layers run on DLA?
You can find some examples in the below GitHub:

Thanks.

Hey,

According to the layerwise profile data those reshapes do not seem to be contributing much to the timings. I’ve attached the profile dump.

DLA_INT8_PROFILE.txt (8.9 KB)
GPU_INT8_PROFILE.txt (8.7 KB)

Thanks.

Hi,

We need to check with our internal team.
Will share more info with you later.

Thanks.

Hi any updates on this?

I ran some more tests to see if the behaviour is consistent and it seems to be so.
Here are the results of those tests.
TRTEXEC_TIMINGS.txt (9.6 MB)

The data in tabular form

The ONNX files used to generate the above data

Hi,

Our internal team is still checking this issue.
Due to the limited resources, it’s expected to take some time to get further info.

But we recently had a new JetPack release.
Would you mind giving it a try?

It’s JetPack 6.0DP with DLA 3.14.

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Hi,

Here are some possible factors why the DLA throughput is much less than GPU:

1. The 2d GEMM(NxC, CxK) here gets translated to 4d Convolution1x1(NxCx1x1, KxCx1x1)
→ The single Conv node is surrounded by Reshapes/Transposes

2. DLA has lower available DRAM BW than GPU.

3. For multi-batched Convolution, the performance sweet spot is the batch size of <16 rather than the bs=1024 here

A much better perf benchmark for the DLA would be running, for example, 3x3 convolutions with input & output channels being multiples of 64, and a sufficiently large input resolution.

For FP16 and INT8 performance difference, please check below description for info as well:

Thanks.