Fail at runing conv layer on DLA

fumihwh · September 28, 2022, 5:07am

Please provide the following info (tick the boxes after creating this topic):
Software Version
DRIVE OS 6.0.4 SDK
other

Target Operating System
Linux
QNX
other

Hardware Platform
DRIVE AGX Orin Developer Kit (940-63710-0010-D00)
DRIVE AGX Orin Developer Kit (940-63710-0010-C00)
DRIVE AGX Orin Developer Kit (not sure its number)
other

SDK Manager Version
1.8.3.10426
other

Host Machine Version
native Ubuntu Linux 20.04 Host installed with SDK Manager
native Ubuntu Linux 20.04 Host installed with DRIVE OS Docker Containers
native Ubuntu Linux 18.04 Host installed with DRIVE OS Docker Containers
other

I am trying to figure out which kind of conv could run at DLA.
I referred to 12.2. DLA Supported Layers and Restrictions and created two simple models only contains conv op as following.

[09/28/2022-12:16:50] [V] [TRT] Adding network input: onnx::Conv_0 with dtype: float32, dimensions: (1, 1706, 3, 3)
[09/28/2022-12:16:50] [V] [TRT] Registering tensor: onnx::Conv_0 for ONNX tensor: onnx::Conv_0
[09/28/2022-12:16:50] [V] [TRT] Parsing node: Constant_0 [Constant]
[09/28/2022-12:16:50] [V] [TRT] Constant_0 [Constant] inputs: 
[09/28/2022-12:16:50] [V] [TRT] Constant_0 [Constant] outputs: [onnx::Conv_1 -> (3, 1706, 3, 3)[FLOAT]], 
[09/28/2022-12:16:50] [V] [TRT] Parsing node: Constant_1 [Constant]
[09/28/2022-12:16:50] [V] [TRT] Constant_1 [Constant] inputs: 
[09/28/2022-12:16:50] [V] [TRT] Constant_1 [Constant] outputs: [onnx::Conv_2 -> (3)[FLOAT]], 
[09/28/2022-12:16:50] [V] [TRT] Parsing node: Conv_2 [Conv]
[09/28/2022-12:16:50] [V] [TRT] Searching for input: onnx::Conv_0
[09/28/2022-12:16:50] [V] [TRT] Searching for input: onnx::Conv_1
[09/28/2022-12:16:50] [V] [TRT] Searching for input: onnx::Conv_2
[09/28/2022-12:16:50] [V] [TRT] Conv_2 [Conv] inputs: [onnx::Conv_0 -> (1, 1706, 3, 3)[FLOAT]], [onnx::Conv_1 -> (3, 1706, 3, 3)[FLOAT]], [onnx::Conv_2 -> (3)[FLOAT]], 
[09/28/2022-12:16:50] [V] [TRT] Convolution input dimensions: (1, 1706, 3, 3)
[09/28/2022-12:16:50] [V] [TRT] Registering layer: Conv_2 for ONNX node: Conv_2
[09/28/2022-12:16:50] [V] [TRT] Using kernel: (3, 3), strides: (1, 1), prepadding: (0, 0), postpadding: (0, 0), dilations: (1, 1), numOutputs: 3
[09/28/2022-12:16:50] [V] [TRT] Convolution output dimensions: (1, 3, 1, 1)
[09/28/2022-12:16:50] [V] [TRT] Registering tensor: 3_0 for ONNX tensor: 3
[09/28/2022-12:16:50] [V] [TRT] Conv_2 [Conv] outputs: [3 -> (1, 3, 1, 1)[FLOAT]], 
[09/28/2022-12:16:50] [V] [TRT] Marking 3_0 as output: 3
[09/28/2022-12:16:50] [I] Finish parsing network model
[09/28/2022-12:16:50] [V] [TRT] Applying generic optimizations to the graph for inference.
[09/28/2022-12:16:50] [V] [TRT] Original: 1 layers
[09/28/2022-12:16:50] [V] [TRT] After dead-layer removal: 1 layers
[09/28/2022-12:16:50] [V] [TRT] After Myelin optimization: 1 layers
[09/28/2022-12:16:50] [V] [TRT] {ForeignNode[Conv_2]} successfully offloaded to DLA.
[09/28/2022-12:16:50] [V] [TRT] Memory consumption details:
[09/28/2022-12:16:50] [V] [TRT] 	Pool Sizes: Managed SRAM = 0.5 MiB,	Local DRAM = 1024 MiB,	Global DRAM = 512 MiB
[09/28/2022-12:16:50] [V] [TRT] 	Required: Managed SRAM = 0.5 MiB,	Local DRAM = 2 MiB,	Global DRAM = 4 MiB
[09/28/2022-12:16:50] [V] [TRT] DLA Memory Consumption Summary:
[09/28/2022-12:16:50] [V] [TRT] 	Number of DLA node candidates offloaded : 1 out of 1
[09/28/2022-12:16:50] [V] [TRT] 	Total memory required by accepted candidates : Managed SRAM = 0.5 MiB,	Local DRAM = 2 MiB,	Global DRAM = 4 MiB
[09/28/2022-12:16:50] [V] [TRT] After DLA optimization: 3 layers
[09/28/2022-12:16:50] [V] [TRT] Applying ScaleNodes fusions.
[09/28/2022-12:16:50] [V] [TRT] After scale fusion: 3 layers
[09/28/2022-12:16:50] [V] [TRT] After dupe layer removal: 3 layers
[09/28/2022-12:16:50] [V] [TRT] After final dead-layer removal: 3 layers
[09/28/2022-12:16:50] [V] [TRT] After tensor merging: 3 layers
[09/28/2022-12:16:50] [V] [TRT] After vertical fusions: 3 layers
[09/28/2022-12:16:50] [V] [TRT] After dupe layer removal: 3 layers
[09/28/2022-12:16:50] [V] [TRT] After final dead-layer removal: 3 layers
[09/28/2022-12:16:50] [V] [TRT] After tensor merging: 3 layers
[09/28/2022-12:16:50] [V] [TRT] After slice removal: 3 layers
[09/28/2022-12:16:50] [V] [TRT] After concat removal: 3 layers
[09/28/2022-12:16:50] [V] [TRT] Trying to split Reshape and strided tensor
[09/28/2022-12:16:50] [V] [TRT] Graph construction and optimization completed in 0.0152455 seconds.
[09/28/2022-12:16:50] [I] [TRT] ---------- Layers Running on DLA ----------
[09/28/2022-12:16:50] [I] [TRT] [DlaLayer] {ForeignNode[Conv_2]}
[09/28/2022-12:16:50] [I] [TRT] ---------- Layers Running on GPU ----------

[09/28/2022-12:49:31] [V] [TRT] Adding network input: onnx::Conv_0 with dtype: float32, dimensions: (1, 1707, 3, 3)
[09/28/2022-12:49:31] [V] [TRT] Registering tensor: onnx::Conv_0 for ONNX tensor: onnx::Conv_0
[09/28/2022-12:49:31] [V] [TRT] Parsing node: Constant_0 [Constant]
[09/28/2022-12:49:31] [V] [TRT] Constant_0 [Constant] inputs: 
[09/28/2022-12:49:31] [V] [TRT] Constant_0 [Constant] outputs: [onnx::Conv_1 -> (3, 1707, 3, 3)[FLOAT]], 
[09/28/2022-12:49:31] [V] [TRT] Parsing node: Constant_1 [Constant]
[09/28/2022-12:49:31] [V] [TRT] Constant_1 [Constant] inputs: 
[09/28/2022-12:49:31] [V] [TRT] Constant_1 [Constant] outputs: [onnx::Conv_2 -> (3)[FLOAT]], 
[09/28/2022-12:49:31] [V] [TRT] Parsing node: Conv_2 [Conv]
[09/28/2022-12:49:31] [V] [TRT] Searching for input: onnx::Conv_0
[09/28/2022-12:49:31] [V] [TRT] Searching for input: onnx::Conv_1
[09/28/2022-12:49:31] [V] [TRT] Searching for input: onnx::Conv_2
[09/28/2022-12:49:31] [V] [TRT] Conv_2 [Conv] inputs: [onnx::Conv_0 -> (1, 1707, 3, 3)[FLOAT]], [onnx::Conv_1 -> (3, 1707, 3, 3)[FLOAT]], [onnx::Conv_2 -> (3)[FLOAT]], 
[09/28/2022-12:49:31] [V] [TRT] Convolution input dimensions: (1, 1707, 3, 3)
[09/28/2022-12:49:31] [V] [TRT] Registering layer: Conv_2 for ONNX node: Conv_2
[09/28/2022-12:49:31] [V] [TRT] Using kernel: (3, 3), strides: (1, 1), prepadding: (0, 0), postpadding: (0, 0), dilations: (1, 1), numOutputs: 3
[09/28/2022-12:49:31] [V] [TRT] Convolution output dimensions: (1, 3, 1, 1)
[09/28/2022-12:49:31] [V] [TRT] Registering tensor: 3_0 for ONNX tensor: 3
[09/28/2022-12:49:31] [V] [TRT] Conv_2 [Conv] outputs: [3 -> (1, 3, 1, 1)[FLOAT]], 
[09/28/2022-12:49:31] [V] [TRT] Marking 3_0 as output: 3
[09/28/2022-12:49:31] [I] Finish parsing network model
[09/28/2022-12:49:31] [V] [TRT] Applying generic optimizations to the graph for inference.
[09/28/2022-12:49:31] [V] [TRT] Original: 1 layers
[09/28/2022-12:49:31] [V] [TRT] After dead-layer removal: 1 layers
[09/28/2022-12:49:31] [V] [TRT] After Myelin optimization: 1 layers
[09/28/2022-12:49:31] [W] [TRT] Validation failed for DLA layer: Conv_2. Switching to GPU fallback.
[09/28/2022-12:49:31] [V] [TRT] DLA Memory Consumption Summary:
[09/28/2022-12:49:31] [V] [TRT] 	Number of DLA node candidates offloaded : 0 out of 0
[09/28/2022-12:49:31] [V] [TRT] 	Total memory required by accepted candidates : Managed SRAM = 0 MiB,	Local DRAM = 0 MiB,	Global DRAM = 0 MiB
[09/28/2022-12:49:31] [V] [TRT] After DLA optimization: 1 layers
[09/28/2022-12:49:31] [V] [TRT] Applying ScaleNodes fusions.
[09/28/2022-12:49:31] [V] [TRT] After scale fusion: 1 layers
[09/28/2022-12:49:31] [V] [TRT] After dupe layer removal: 1 layers
[09/28/2022-12:49:31] [V] [TRT] After final dead-layer removal: 1 layers
[09/28/2022-12:49:31] [V] [TRT] After tensor merging: 1 layers
[09/28/2022-12:49:31] [V] [TRT] After vertical fusions: 1 layers
[09/28/2022-12:49:31] [V] [TRT] After dupe layer removal: 1 layers
[09/28/2022-12:49:31] [V] [TRT] After final dead-layer removal: 1 layers
[09/28/2022-12:49:31] [V] [TRT] After tensor merging: 1 layers
[09/28/2022-12:49:31] [V] [TRT] After slice removal: 1 layers
[09/28/2022-12:49:31] [V] [TRT] After concat removal: 1 layers
[09/28/2022-12:49:31] [V] [TRT] Trying to split Reshape and strided tensor
[09/28/2022-12:49:31] [V] [TRT] Graph construction and optimization completed in 0.000808394 seconds.
[09/28/2022-12:49:31] [I] [TRT] ---------- Layers Running on DLA ----------
[09/28/2022-12:49:31] [I] [TRT] ---------- Layers Running on GPU ----------
[09/28/2022-12:49:31] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_2

trtexec command:
trtexec --onnx=sample.onnx --fp16 --useDLACore=0 --allowGPUFallback --exportProfile=sample.dla.profile.json --exportLayerInfo=sample.dla.layerinfo.json --exportOutput=sample.dla.output.json --dumpLayerInfo --dumpProfile --profilingVerbosity=detailed --separateProfileRun --useSpinWait --useCudaGraph --saveEngine=sample.dla.engine --verbose

My simple question:
Why conv with input shape [1*1706*3*3] could run at DLA but [1*1707*3*3] could not?
And which restriction I have met?

kayccc · September 28, 2022, 6:57am

May I know what;'s the HW platform you used?
And on what’s kind of OS?

Thanks

fumihwh · September 28, 2022, 7:58am

Thanks for reply. Following informations are collected by jetson_stats.

NVIDIA Jetson UNKNOWN
- Jetpack UNKNOWN [L4T 35.1.0]
- NV Power Mode: MAXN - Type: 0
- jetson_stats.service: active
Libraries:
- CUDA: NOT_INSTALLED
- cuDNN: 8.4.1.50
- TensorRT: 8.4.1.5
- Visionworks: NOT_INSTALLED
- OpenCV: 4.5.4 compiled CUDA: NO
- VPI: ii libnvvpi2 2.1.6 arm64 NVIDIA Vision Programming Interface library
- Vulkan: 1.3.203

declare -x JETSON_BOARD="P3737-000"
declare -x JETSON_BOARDIDS=""
declare -x JETSON_CHIP_ID=""
declare -x JETSON_CODENAME="concord"
declare -x JETSON_CUDA="NOT_INSTALLED"
declare -x JETSON_CUDA_ARCH_BIN="NONE"
declare -x JETSON_CUDNN="8.4.1.50"
declare -x JETSON_JETPACK="UNKNOWN"
declare -x JETSON_L4T="35.1.0"
declare -x JETSON_L4T_RELEASE="35"
declare -x JETSON_L4T_REVISION="1.0"
declare -x JETSON_MACHINE="NVIDIA Jetson UNKNOWN"
declare -x JETSON_MODULE="UNKNOWN"
declare -x JETSON_OPENCV="4.5.4"
declare -x JETSON_OPENCV_CUDA="NO"
declare -x JETSON_SOC="tegra23x"
declare -x JETSON_TENSORRT="8.4.1.5"
declare -x JETSON_TYPE="UNKNOWN"
declare -x JETSON_VISIONWORKS="NOT_INSTALLED"
declare -x JETSON_VPI="ii libnvvpi2 2.1.6 arm64 NVIDIA Vision Programming Interface library"
declare -x JETSON_VULKAN_INFO="1.3.203"

kayccc · September 28, 2022, 11:12pm

Your topic was posted in the wrong category. I am moving this to the Jetson AGX Orin category for visibility.

AastaLLL · September 29, 2022, 2:14am

Hi,

It’s possible that when the kernel number increases to #1707, the required resources are out of DLA capability.
Could you share a source for ONNX model generation so we can check this with the dev team to get further information?

Thanks.

fumihwh · September 29, 2022, 2:23am

import torch
from torch import nn
from torch.nn import functional as F


C = 1707

x = torch.rand(1, C, 3, 3).cuda().half()
w = torch.rand(3, C, 3, 3).cuda().half()
b = torch.rand(3, ).cuda().half()


class Mod(nn.Module):
    def forward(self, x):
        return F.conv2d(x, w, b)


model = Mod()
torch.onnx.export(model, (x,), "sample.onnx")

Thanks.

I also checked if the numbers meet the CBUF size requirement by following code. The result was PASSED.

import numpy as np

INT8 = False
inputDims_c = 1707
inputDims_w = 3
kernelSize_h = 3
kernelSize_w = 3
dilation_h = 1

entriesPerDataSlice = np.ceil(np.ceil(inputDims_c * (1 if INT8 else 2) / 32.0) * inputDims_w /
                              4.0).astype(np.uint32)
dilatedKernelHt = (kernelSize_h - 1) * dilation_h + 1

wtBanksForOneKernel = np.ceil(
    np.round(inputDims_c * kernelSize_h * kernelSize_w * (1 if INT8 else 2), 128) / 32768.0).astype(np.uint32)
minDataBanks = np.ceil(float(entriesPerDataSlice * dilatedKernelHt) / 256.0).astype(np.uint32)

print((wtBanksForOneKernel + minDataBanks) <= 16)

AastaLLL · October 3, 2022, 5:45am

Hi,

Do you mean the 1707 failure case is still within the constraint of CBUF?
Thanks.

fumihwh · October 8, 2022, 1:56am

Yes. Actually, the 1707 failure case meets all the constraints from the open document.

fumihwh · October 14, 2022, 9:09am

Any updates here?

AastaLLL · October 17, 2022, 7:23am

Hi,

Thanks for your patience.

We have confirmed that the same behavior can be reproduced in our environment as well.
And now are checking with our internal team.

Will share more information with you later.

Thanks.

AastaLLL · October 18, 2022, 3:07am

Hi,

The layer cannot deploy on DLA since it exceeds the CBUF limit.

For c=1707:

Number of weight banks
        = roundUp(numChannels * kernelHeight * kernelWidth * 32, 128) / (cbufEntryWidth * cbufEntriesPerBank)
        = roundUp(1707 * 3 * 3 * 32, 128) / (128 * 256)
        = 16

For c=1706:

Number of weight banks
        = roundUp(numChannels * kernelHeight * kernelWidth * 32, 128) / (cbufEntryWidth * cbufEntriesPerBank)
        = roundUp(1706 * 3 * 3 * 32, 128) / (128 * 256)
        = 15

minDataBanks=1, so
[c=1707]: fails since wtBanksForOneKernel + minDataBanks > 16
[c=1706]: pass since wtBanksForOneKernel + minDataBanks <= 16

The equation listed in our current document is not clear enough.
Our internal team will improve this soon.

Thanks.

fumihwh · October 19, 2022, 7:30am

Hi, thanks for your reply.

Number of weight banks
        = roundUp(numChannels * kernelHeight * kernelWidth * 32, 128) / (cbufEntryWidth * cbufEntriesPerBank)
        = roundUp(1707 * 3 * 3 * 32, 128) / (128 * 256)
        = 16

The fomulation is different with document.
Why do you multiply 32?
In document, the multiple is (INT8 ? 1 : 2).

AastaLLL · October 20, 2022, 2:05am

Hi,

Sorry that the formula is incorrect currently.
It will be updated soon.

Thanks.

system · November 9, 2022, 5:17am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
I get Internal DLA error and it runs on GPU FallBack Jetson Xavier NX tensorrt , nvbugs , dla	10	1308	October 18, 2021
[Xavier NX + DLA] does not support dynamic shapes, and CBUF size requirement Jetson Xavier NX tensorrt , nvbugs , dla	9	1977	October 18, 2021
Orin AGX TensorRT DLA export fails but conv specs are below DLA layer restrictions Jetson AGX Orin tensorrt , dla	3	463	February 6, 2024
Uff-parser for DLA bug reports Jetson AGX Xavier	6	907	October 18, 2021
Regnet with DLA may not work sometimes Jetson AGX Xavier tensorrt , dla	4	805	October 18, 2021
CBUF size limit calculation Jetson Xavier NX tensorrt , dla	9	1162	December 28, 2022
Observing different output when running layer on DLA vs running on GPU Jetson AGX Xavier dla	5	1186	June 25, 2021
Cannot build a TensorRT engine for DLA from a large ONNX file Jetson Xavier NX tensorrt , nvbugs , dla	12	2791	July 21, 2021
Run pure conv2d node on DLA makes GPU get slower Jetson AGX Orin tensorrt	8	1526	July 12, 2022
Jetson Orin: All layers pushed to GPU, zero layers on DLA Jetson AGX Orin tensorrt , dla	7	1137	April 26, 2023

Fail at runing conv layer on DLA

Related topics