Jetson Orin: All layers pushed to GPU, zero layers on DLA

Hi,

I am converting a model from Pytorch → ONNX → TensorRT.
The model has simple conv layers followed by ReLU. I am using trtexec to convert the model to TRT engine and
it says that all layers go on GPU

[TRT] ---------- Layers Running on DLA ----------
[03/07/2023-09:53:32] [I] [TRT] ---------- Layers Running on GPU ----------
[03/07/2023-09:53:32] [I] [TRT] [GpuLayer] COPY: Reformatting CopyNode for Network Input input
[03/07/2023-09:53:32] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_0 + Relu_1
[03/07/2023-09:53:32] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_2 + Relu_3
[03/07/2023-09:53:32] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_4 + Relu_5
[03/07/2023-09:53:32] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_7 + Relu_8
[03/07/2023-09:53:32] [I] [TRT] [GpuLayer] COPY: (Unnamed Layer* 9) [Identity]
[03/07/2023-09:53:32] [I] [TRT] [GpuLayer] POOLING: AveragePool_10
[03/07/2023-09:53:32] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_11 + Relu_12
[03/07/2023-09:53:32] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_13 + Relu_14
[03/07/2023-09:53:32] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_15 + Relu_16
[03/07/2023-09:53:32] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_18 + Relu_19
[03/07/2023-09:53:32] [I] [TRT] [GpuLayer] COPY: (Unnamed Layer* 20) [Identity]
[03/07/2023-09:53:32] [I] [TRT] [GpuLayer] POOLING: AveragePool_21
[03/07/2023-09:53:32] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_22 + Relu_23
[03/07/2023-09:53:32] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_24 + Relu_25
[03/07/2023-09:53:32] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_26 + Relu_27
[03/07/2023-09:53:32] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_29 + Relu_30
[03/07/2023-09:53:32] [I] [TRT] [GpuLayer] COPY: (Unnamed Layer* 31) [Identity]
[03/07/2023-09:53:32] [I] [TRT] [GpuLayer] POOLING: AveragePool_32
[03/07/2023-09:53:32] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_33 + Relu_34
[03/07/2023-09:53:32] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_35 + Relu_36
[03/07/2023-09:53:32] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_37 + Relu_38
[03/07/2023-09:53:32] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_40 + Relu_41
[03/07/2023-09:53:32] [I] [TRT] [GpuLayer] COPY: (Unnamed Layer* 42) [Identity]
[03/07/2023-09:53:32] [I] [TRT] [GpuLayer] POOLING: AveragePool_43
[03/07/2023-09:53:32] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_44 + Relu_45
[03/07/2023-09:53:32] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_46 + Relu_47
[03/07/2023-09:53:32] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_48 + Relu_49
[03/07/2023-09:53:32] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_51 + Relu_52
[03/07/2023-09:53:32] [I] [TRT] [GpuLayer] DECONVOLUTION: ConvTranspose_53 + BatchNormalization_54 + Relu_55

Why is this the case? How can I control so that I leverage the DLA for some computation?

Input is [1,1,48,512], meaning batch size=1

The network in pytorch:

def __init__(self,
             in_channels: int,
             out_channels: int,
             kernel_long: Tuple = (7, 3),
             kernel_square: int = 3) -> None:
    super(NSKBlock, self).__init__()

    self.branch1 = BasicConv2d(in_channels, out_channels,
                               kernel_size=kernel_long, padding=(2, 0))
    self.branch2 = BasicConv2d(in_channels, out_channels,
                               kernel_size=kernel_square)
    self.branch3 = BasicConv2d(in_channels, out_channels,
                               kernel_size=kernel_long[::-1], padding=(0, 2))
    self.conv = BasicConv2d(out_channels * 3, out_channels, kernel_size=1, padding=1)

Hi,

Just want to confirm if you are using Jetson Orin.
If yes, we will move your topic to the Orin board.

DLA only supports fp16/int8 format, do you convert your model with one of the precision modes?
Thanks.

Yes I use Jetson AGX Orin.

I added the --fp16 flag for trtexec, it says mixed precision now (FP16 + FP32) however still all layers pushed on GPU, 0 on DLA…

I tried both --fp16 and --int8 and for both it’s the same, still nothing on DLA.

What else can be done @AastaLLL ?

When I run

./trtexec --onnx=pretrained.onnx --saveEngine=/engine_fp16.trt --useDLACore=0 --verbose

I am getting an error

[03/08/2023-08:19:08] [E] Error[4]: [network.cpp::validate::2789] Error Code 4: Internal Error (DLA validation failed)

Update:

The only command that pushes something on DLA is:

./trtexec --onnx=pretrained.onnx --saveEngine=engine_fp16.trt --int8 --fp16 --best --useDLACore=0 --allowGPUFallback

Conv layers still on GPU:

[03/08/2023-08:48:47] [I] [TRT] ---------- Layers Running on DLA ----------
[03/08/2023-08:48:47] [I] [TRT] [DlaLayer] {ForeignNode[Transpose36]}
[03/08/2023-08:48:47] [I] [TRT] [DlaLayer] {ForeignNode[batch_normalization_1…Relu9]}
[03/08/2023-08:48:47] [I] [TRT] [DlaLayer] {ForeignNode[batch_normalization_2…Relu24]}
[03/08/2023-08:48:47] [I] [TRT] [DlaLayer] {ForeignNode[Relu23]}
[03/08/2023-08:48:47] [I] [TRT] [DlaLayer] {ForeignNode[batch_normalization_13…concatenate_1]}
[03/08/2023-08:48:47] [I] [TRT] [DlaLayer] {ForeignNode[Relu19…Relu18]}
[03/08/2023-08:48:47] [I] [TRT] [DlaLayer] {ForeignNode[batch_normalization_16…concatenate_2]}
[03/08/2023-08:48:47] [I] [TRT] [DlaLayer] {ForeignNode[Relu14…Relu13]}
[03/08/2023-08:48:47] [I] [TRT] [DlaLayer] {ForeignNode[batch_normalization_19…Relu6]}
[03/08/2023-08:48:47] [I] [TRT] [DlaLayer] {ForeignNode[batch_normalization_22…PushTranspose_1]}
[03/08/2023-08:48:47] [I] [TRT] ---------- Layers Running on GPU ----------
[03/08/2023-08:48:47] [I] [TRT] [GpuLayer] DECONVOLUTION: conv2d_transpose_1
[03/08/2023-08:48:47] [I] [TRT] [GpuLayer] DECONVOLUTION: conv2d_transpose_2
[03/08/2023-08:48:47] [I] [TRT] [GpuLayer] CONVOLUTION: conv2d_10
[03/08/2023-08:48:47] [I] [TRT] [GpuLayer] DECONVOLUTION: conv2d_transpose_3
[03/08/2023-08:48:47] [I] [TRT] [GpuLayer] CONVOLUTION: conv2d_11
[03/08/2023-08:48:47] [I] [TRT] [GpuLayer] DECONVOLUTION: conv2d_transpose_4
[03/08/2023-08:48:47] [I] [TRT] [GpuLayer] CONVOLUTION: conv2d_13
[03/08/2023-08:48:47] [I] [TRT] [GpuLayer] DECONVOLUTION: conv2d_transpose_5
[03/08/2023-08:48:47] [I] [TRT] [GpuLayer] DECONVOLUTION: conv2d_transpose_6

Further update:

With the above engine, the performance of the generative model has greatly reduced. Any idea why this could be? By just pushing batch norm, relu and transpose on DLA…?

Hi,

There are some constraints for the DLA convolution layer.
Please check the document below:

When the inference jobs keep switching using GPU and DLA, the overhead might slow down the performance.

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

@schenn
Check out the DLA github page for samples and resources: Recipes and tools for running deep learning workloads on NVIDIA DLA cores for inference applications.

We have a FAQ page that addresses some common questions that we see developers run into: Deep-Learning-Accelerator-SW/FAQ