Fail at runing conv layer on DLA

Please provide the following info (tick the boxes after creating this topic):
Software Version
DRIVE OS 6.0.4 SDK
other

Target Operating System
Linux
QNX
other

Hardware Platform
DRIVE AGX Orin Developer Kit (940-63710-0010-D00)
DRIVE AGX Orin Developer Kit (940-63710-0010-C00)
DRIVE AGX Orin Developer Kit (not sure its number)
other

SDK Manager Version
1.8.3.10426
other

Host Machine Version
native Ubuntu Linux 20.04 Host installed with SDK Manager
native Ubuntu Linux 20.04 Host installed with DRIVE OS Docker Containers
native Ubuntu Linux 18.04 Host installed with DRIVE OS Docker Containers
other

I am trying to figure out which kind of conv could run at DLA.
I referred to 12.2. DLA Supported Layers and Restrictions and created two simple models only contains conv op as following.

[09/28/2022-12:16:50] [V] [TRT] Adding network input: onnx::Conv_0 with dtype: float32, dimensions: (1, 1706, 3, 3)
[09/28/2022-12:16:50] [V] [TRT] Registering tensor: onnx::Conv_0 for ONNX tensor: onnx::Conv_0
[09/28/2022-12:16:50] [V] [TRT] Parsing node: Constant_0 [Constant]
[09/28/2022-12:16:50] [V] [TRT] Constant_0 [Constant] inputs: 
[09/28/2022-12:16:50] [V] [TRT] Constant_0 [Constant] outputs: [onnx::Conv_1 -> (3, 1706, 3, 3)[FLOAT]], 
[09/28/2022-12:16:50] [V] [TRT] Parsing node: Constant_1 [Constant]
[09/28/2022-12:16:50] [V] [TRT] Constant_1 [Constant] inputs: 
[09/28/2022-12:16:50] [V] [TRT] Constant_1 [Constant] outputs: [onnx::Conv_2 -> (3)[FLOAT]], 
[09/28/2022-12:16:50] [V] [TRT] Parsing node: Conv_2 [Conv]
[09/28/2022-12:16:50] [V] [TRT] Searching for input: onnx::Conv_0
[09/28/2022-12:16:50] [V] [TRT] Searching for input: onnx::Conv_1
[09/28/2022-12:16:50] [V] [TRT] Searching for input: onnx::Conv_2
[09/28/2022-12:16:50] [V] [TRT] Conv_2 [Conv] inputs: [onnx::Conv_0 -> (1, 1706, 3, 3)[FLOAT]], [onnx::Conv_1 -> (3, 1706, 3, 3)[FLOAT]], [onnx::Conv_2 -> (3)[FLOAT]], 
[09/28/2022-12:16:50] [V] [TRT] Convolution input dimensions: (1, 1706, 3, 3)
[09/28/2022-12:16:50] [V] [TRT] Registering layer: Conv_2 for ONNX node: Conv_2
[09/28/2022-12:16:50] [V] [TRT] Using kernel: (3, 3), strides: (1, 1), prepadding: (0, 0), postpadding: (0, 0), dilations: (1, 1), numOutputs: 3
[09/28/2022-12:16:50] [V] [TRT] Convolution output dimensions: (1, 3, 1, 1)
[09/28/2022-12:16:50] [V] [TRT] Registering tensor: 3_0 for ONNX tensor: 3
[09/28/2022-12:16:50] [V] [TRT] Conv_2 [Conv] outputs: [3 -> (1, 3, 1, 1)[FLOAT]], 
[09/28/2022-12:16:50] [V] [TRT] Marking 3_0 as output: 3
[09/28/2022-12:16:50] [I] Finish parsing network model
[09/28/2022-12:16:50] [V] [TRT] Applying generic optimizations to the graph for inference.
[09/28/2022-12:16:50] [V] [TRT] Original: 1 layers
[09/28/2022-12:16:50] [V] [TRT] After dead-layer removal: 1 layers
[09/28/2022-12:16:50] [V] [TRT] After Myelin optimization: 1 layers
[09/28/2022-12:16:50] [V] [TRT] {ForeignNode[Conv_2]} successfully offloaded to DLA.
[09/28/2022-12:16:50] [V] [TRT] Memory consumption details:
[09/28/2022-12:16:50] [V] [TRT] 	Pool Sizes: Managed SRAM = 0.5 MiB,	Local DRAM = 1024 MiB,	Global DRAM = 512 MiB
[09/28/2022-12:16:50] [V] [TRT] 	Required: Managed SRAM = 0.5 MiB,	Local DRAM = 2 MiB,	Global DRAM = 4 MiB
[09/28/2022-12:16:50] [V] [TRT] DLA Memory Consumption Summary:
[09/28/2022-12:16:50] [V] [TRT] 	Number of DLA node candidates offloaded : 1 out of 1
[09/28/2022-12:16:50] [V] [TRT] 	Total memory required by accepted candidates : Managed SRAM = 0.5 MiB,	Local DRAM = 2 MiB,	Global DRAM = 4 MiB
[09/28/2022-12:16:50] [V] [TRT] After DLA optimization: 3 layers
[09/28/2022-12:16:50] [V] [TRT] Applying ScaleNodes fusions.
[09/28/2022-12:16:50] [V] [TRT] After scale fusion: 3 layers
[09/28/2022-12:16:50] [V] [TRT] After dupe layer removal: 3 layers
[09/28/2022-12:16:50] [V] [TRT] After final dead-layer removal: 3 layers
[09/28/2022-12:16:50] [V] [TRT] After tensor merging: 3 layers
[09/28/2022-12:16:50] [V] [TRT] After vertical fusions: 3 layers
[09/28/2022-12:16:50] [V] [TRT] After dupe layer removal: 3 layers
[09/28/2022-12:16:50] [V] [TRT] After final dead-layer removal: 3 layers
[09/28/2022-12:16:50] [V] [TRT] After tensor merging: 3 layers
[09/28/2022-12:16:50] [V] [TRT] After slice removal: 3 layers
[09/28/2022-12:16:50] [V] [TRT] After concat removal: 3 layers
[09/28/2022-12:16:50] [V] [TRT] Trying to split Reshape and strided tensor
[09/28/2022-12:16:50] [V] [TRT] Graph construction and optimization completed in 0.0152455 seconds.
[09/28/2022-12:16:50] [I] [TRT] ---------- Layers Running on DLA ----------
[09/28/2022-12:16:50] [I] [TRT] [DlaLayer] {ForeignNode[Conv_2]}
[09/28/2022-12:16:50] [I] [TRT] ---------- Layers Running on GPU ----------

[09/28/2022-12:49:31] [V] [TRT] Adding network input: onnx::Conv_0 with dtype: float32, dimensions: (1, 1707, 3, 3)
[09/28/2022-12:49:31] [V] [TRT] Registering tensor: onnx::Conv_0 for ONNX tensor: onnx::Conv_0
[09/28/2022-12:49:31] [V] [TRT] Parsing node: Constant_0 [Constant]
[09/28/2022-12:49:31] [V] [TRT] Constant_0 [Constant] inputs: 
[09/28/2022-12:49:31] [V] [TRT] Constant_0 [Constant] outputs: [onnx::Conv_1 -> (3, 1707, 3, 3)[FLOAT]], 
[09/28/2022-12:49:31] [V] [TRT] Parsing node: Constant_1 [Constant]
[09/28/2022-12:49:31] [V] [TRT] Constant_1 [Constant] inputs: 
[09/28/2022-12:49:31] [V] [TRT] Constant_1 [Constant] outputs: [onnx::Conv_2 -> (3)[FLOAT]], 
[09/28/2022-12:49:31] [V] [TRT] Parsing node: Conv_2 [Conv]
[09/28/2022-12:49:31] [V] [TRT] Searching for input: onnx::Conv_0
[09/28/2022-12:49:31] [V] [TRT] Searching for input: onnx::Conv_1
[09/28/2022-12:49:31] [V] [TRT] Searching for input: onnx::Conv_2
[09/28/2022-12:49:31] [V] [TRT] Conv_2 [Conv] inputs: [onnx::Conv_0 -> (1, 1707, 3, 3)[FLOAT]], [onnx::Conv_1 -> (3, 1707, 3, 3)[FLOAT]], [onnx::Conv_2 -> (3)[FLOAT]], 
[09/28/2022-12:49:31] [V] [TRT] Convolution input dimensions: (1, 1707, 3, 3)
[09/28/2022-12:49:31] [V] [TRT] Registering layer: Conv_2 for ONNX node: Conv_2
[09/28/2022-12:49:31] [V] [TRT] Using kernel: (3, 3), strides: (1, 1), prepadding: (0, 0), postpadding: (0, 0), dilations: (1, 1), numOutputs: 3
[09/28/2022-12:49:31] [V] [TRT] Convolution output dimensions: (1, 3, 1, 1)
[09/28/2022-12:49:31] [V] [TRT] Registering tensor: 3_0 for ONNX tensor: 3
[09/28/2022-12:49:31] [V] [TRT] Conv_2 [Conv] outputs: [3 -> (1, 3, 1, 1)[FLOAT]], 
[09/28/2022-12:49:31] [V] [TRT] Marking 3_0 as output: 3
[09/28/2022-12:49:31] [I] Finish parsing network model
[09/28/2022-12:49:31] [V] [TRT] Applying generic optimizations to the graph for inference.
[09/28/2022-12:49:31] [V] [TRT] Original: 1 layers
[09/28/2022-12:49:31] [V] [TRT] After dead-layer removal: 1 layers
[09/28/2022-12:49:31] [V] [TRT] After Myelin optimization: 1 layers
[09/28/2022-12:49:31] [W] [TRT] Validation failed for DLA layer: Conv_2. Switching to GPU fallback.
[09/28/2022-12:49:31] [V] [TRT] DLA Memory Consumption Summary:
[09/28/2022-12:49:31] [V] [TRT] 	Number of DLA node candidates offloaded : 0 out of 0
[09/28/2022-12:49:31] [V] [TRT] 	Total memory required by accepted candidates : Managed SRAM = 0 MiB,	Local DRAM = 0 MiB,	Global DRAM = 0 MiB
[09/28/2022-12:49:31] [V] [TRT] After DLA optimization: 1 layers
[09/28/2022-12:49:31] [V] [TRT] Applying ScaleNodes fusions.
[09/28/2022-12:49:31] [V] [TRT] After scale fusion: 1 layers
[09/28/2022-12:49:31] [V] [TRT] After dupe layer removal: 1 layers
[09/28/2022-12:49:31] [V] [TRT] After final dead-layer removal: 1 layers
[09/28/2022-12:49:31] [V] [TRT] After tensor merging: 1 layers
[09/28/2022-12:49:31] [V] [TRT] After vertical fusions: 1 layers
[09/28/2022-12:49:31] [V] [TRT] After dupe layer removal: 1 layers
[09/28/2022-12:49:31] [V] [TRT] After final dead-layer removal: 1 layers
[09/28/2022-12:49:31] [V] [TRT] After tensor merging: 1 layers
[09/28/2022-12:49:31] [V] [TRT] After slice removal: 1 layers
[09/28/2022-12:49:31] [V] [TRT] After concat removal: 1 layers
[09/28/2022-12:49:31] [V] [TRT] Trying to split Reshape and strided tensor
[09/28/2022-12:49:31] [V] [TRT] Graph construction and optimization completed in 0.000808394 seconds.
[09/28/2022-12:49:31] [I] [TRT] ---------- Layers Running on DLA ----------
[09/28/2022-12:49:31] [I] [TRT] ---------- Layers Running on GPU ----------
[09/28/2022-12:49:31] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_2

trtexec command:
trtexec --onnx=sample.onnx --fp16 --useDLACore=0 --allowGPUFallback --exportProfile=sample.dla.profile.json --exportLayerInfo=sample.dla.layerinfo.json --exportOutput=sample.dla.output.json --dumpLayerInfo --dumpProfile --profilingVerbosity=detailed --separateProfileRun --useSpinWait --useCudaGraph --saveEngine=sample.dla.engine --verbose

My simple question:
Why conv with input shape [1*1706*3*3] could run at DLA but [1*1707*3*3] could not?
And which restriction I have met?

May I know what;'s the HW platform you used?
And on what’s kind of OS?

Thanks

Thanks for reply. Following informations are collected by jetson_stats.

  • NVIDIA Jetson UNKNOWN
    • Jetpack UNKNOWN [L4T 35.1.0]
    • NV Power Mode: MAXN - Type: 0
    • jetson_stats.service: active
  • Libraries:
    • CUDA: NOT_INSTALLED
    • cuDNN: 8.4.1.50
    • TensorRT: 8.4.1.5
    • Visionworks: NOT_INSTALLED
    • OpenCV: 4.5.4 compiled CUDA: NO
    • VPI: ii libnvvpi2 2.1.6 arm64 NVIDIA Vision Programming Interface library
    • Vulkan: 1.3.203
declare -x JETSON_BOARD="P3737-000"
declare -x JETSON_BOARDIDS=""
declare -x JETSON_CHIP_ID=""
declare -x JETSON_CODENAME="concord"
declare -x JETSON_CUDA="NOT_INSTALLED"
declare -x JETSON_CUDA_ARCH_BIN="NONE"
declare -x JETSON_CUDNN="8.4.1.50"
declare -x JETSON_JETPACK="UNKNOWN"
declare -x JETSON_L4T="35.1.0"
declare -x JETSON_L4T_RELEASE="35"
declare -x JETSON_L4T_REVISION="1.0"
declare -x JETSON_MACHINE="NVIDIA Jetson UNKNOWN"
declare -x JETSON_MODULE="UNKNOWN"
declare -x JETSON_OPENCV="4.5.4"
declare -x JETSON_OPENCV_CUDA="NO"
declare -x JETSON_SOC="tegra23x"
declare -x JETSON_TENSORRT="8.4.1.5"
declare -x JETSON_TYPE="UNKNOWN"
declare -x JETSON_VISIONWORKS="NOT_INSTALLED"
declare -x JETSON_VPI="ii libnvvpi2 2.1.6 arm64 NVIDIA Vision Programming Interface library"
declare -x JETSON_VULKAN_INFO="1.3.203"

Your topic was posted in the wrong category. I am moving this to the Jetson AGX Orin category for visibility.

Hi,

It’s possible that when the kernel number increases to #1707, the required resources are out of DLA capability.
Could you share a source for ONNX model generation so we can check this with the dev team to get further information?

Thanks.

import torch
from torch import nn
from torch.nn import functional as F


C = 1707

x = torch.rand(1, C, 3, 3).cuda().half()
w = torch.rand(3, C, 3, 3).cuda().half()
b = torch.rand(3, ).cuda().half()


class Mod(nn.Module):
    def forward(self, x):
        return F.conv2d(x, w, b)


model = Mod()
torch.onnx.export(model, (x,), "sample.onnx")

Thanks.

I also checked if the numbers meet the CBUF size requirement by following code. The result was PASSED.

import numpy as np

INT8 = False
inputDims_c = 1707
inputDims_w = 3
kernelSize_h = 3
kernelSize_w = 3
dilation_h = 1

entriesPerDataSlice = np.ceil(np.ceil(inputDims_c * (1 if INT8 else 2) / 32.0) * inputDims_w /
                              4.0).astype(np.uint32)
dilatedKernelHt = (kernelSize_h - 1) * dilation_h + 1

wtBanksForOneKernel = np.ceil(
    np.round(inputDims_c * kernelSize_h * kernelSize_w * (1 if INT8 else 2), 128) / 32768.0).astype(np.uint32)
minDataBanks = np.ceil(float(entriesPerDataSlice * dilatedKernelHt) / 256.0).astype(np.uint32)

print((wtBanksForOneKernel + minDataBanks) <= 16)

Hi,

Do you mean the 1707 failure case is still within the constraint of CBUF?
Thanks.

Yes. Actually, the 1707 failure case meets all the constraints from the open document.

Any updates here?

Hi,

Thanks for your patience.

We have confirmed that the same behavior can be reproduced in our environment as well.
And now are checking with our internal team.

Will share more information with you later.

Thanks.

Hi,

The layer cannot deploy on DLA since it exceeds the CBUF limit.

For c=1707:

Number of weight banks
        = roundUp(numChannels * kernelHeight * kernelWidth * 32, 128) / (cbufEntryWidth * cbufEntriesPerBank)
        = roundUp(1707 * 3 * 3 * 32, 128) / (128 * 256)
        = 16

For c=1706:

Number of weight banks
        = roundUp(numChannels * kernelHeight * kernelWidth * 32, 128) / (cbufEntryWidth * cbufEntriesPerBank)
        = roundUp(1706 * 3 * 3 * 32, 128) / (128 * 256)
        = 15

minDataBanks=1, so
[c=1707]: fails since wtBanksForOneKernel + minDataBanks > 16
[c=1706]: pass since wtBanksForOneKernel + minDataBanks <= 16

The equation listed in our current document is not clear enough.
Our internal team will improve this soon.

Thanks.

Hi, thanks for your reply.

Number of weight banks
        = roundUp(numChannels * kernelHeight * kernelWidth * 32, 128) / (cbufEntryWidth * cbufEntriesPerBank)
        = roundUp(1707 * 3 * 3 * 32, 128) / (128 * 256)
        = 16


The fomulation is different with document.
Why do you multiply 32?
In document, the multiple is (INT8 ? 1 : 2).

Hi,

Sorry that the formula is incorrect currently.
It will be updated soon.

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.