Observing different output when running layer on DLA vs running on GPU

VivekKrishnan · June 8, 2021, 8:28pm

Hey there, I am running into some errors when trying to run certain layers on the DLA. Here is the setup I am currently using for running models.

Platform: Jetson AGX Xavier
Jetpack Version: 4.4.1
TensorRT Version: 7.1.3

I am currently using a workflow where I generate a model in Pytorch, convert it to onnx, and then load in trtexec using the --onnx parameter
Pytorch: 1.4.5
Onnx: 1.6.0

Here is the code I am using to export the onnx models:

   torch.onnx.export(
       pytorch_model,
       input_tensor,
       “model.onnx”,
       output_names=[“output”],
       opset_version=7,
       do_constant_folding=True,
       export_params=True)

For each model, I am passing an input tensor with shape (1, n, 32, 32)

To run the model on my Xavier, I am using the command:

trtexec --fp16 --workspace=64 --iterations=1 --warmUp=0 --duration=0 --onnx=model.onnx --exportOutput=output.json

When running on the DLA, I use the command:

trtexec --fp16 --workspace=64 --iterations=1 --warmUp=0 --duration=0 --useDLACore=0 --onnx=model.onnx --exportOutput=output.json

I have been validating whether the DLA produces the same output as running the layer on the GPU and I have been getting some interesting results. Specifically, I have been trying to implement convolutions as follows (for different values of n):

from torch import nn
conv = nn.Conv2d(n, n, kernel_size=3, stride=1, padding=1, groups=n, bias=False)

I have run this validation with n in [4, 64] and have no issues (gpu output equals dla output). However, beyond n=64, I run into some interesting results.

When I set n to 68, I notice that the exported output file for the DLA contains NAN values and the GPU and DLA contain different values
model68.onnx (2.7 KB)

When I set n to 96, there are no Nan values but the GPU and DLA contain different values
model96.onnx (3.7 KB)

When I set n to 128 or 256 or 512 or , the GPU and the DLA tensors match exactly. It seems that powers of 2 work fine. Is this expected behavior for the DLA tensors?
model256.onnx (9.3 KB)
model512.onnx (18.3 KB)

With this setup, when I set n to 1024, the layer fails to load onto the DLA. I assume that when using fp16 mode, the weight tensor size (1024, 1, 3, 3) or input size (1, 1024, 32, 32) exceeds the maximum size for the convolution buffer.
model1024.onnx (36.3 KB)

It seems to state on the nvdla hwarch page that that Convolution Buffer has the following constraints:

Range of values: 4KB~32KB

This seems to be different from the dla support page which defines the following constraints:

The maximum size of weights supported by DLA is 512 MB.
A DLA network can only support up to 1 GB of intermediate tensor data. Tensors that are the input and output to the DLA graph are not counted against this limit. TensorRT will reject networks that exceed this limit that are built without GPU fallback enabled.
Number of output maps must be in the range [1, 8192].
Number of groups must be in the range [1, 8192] for operations using the formats TensorFormat::kLINEAR, TensorFormat::kCHW16, and TensorFormat::kCHW32.

TLDR:
When building pytorch models, do convolutions only support group parameters with a power of 2?
How do I determine how big a layer can be before it’s not supported on the DLA (in fp16 mode)?
Is this issue resolved with the next Jetpack release

AastaLLL · June 9, 2021, 4:12am

Hi,

In general, you can check the dmesg log to see if DLA works correctly.
With your model, we can find some DLA failure on n=68 or n=96 model.

[90032.012347] t19x-arm-smmu 12000000.iommu: Unhandled context fault: smmu0, iova=0x1ffebe9000, fsynr=0x230013, cb=20, sid=81(0x51 - NVDLA0), pgd=85b367003, pud=85b367003, pmd=7fd4af003, pte=0
[90032.012664] t19x-arm-smmu 12000000.iommu: Unhandled context fault: smmu1, iova=0x1ffebe8000, fsynr=0x13, cb=20, sid=81(0x51 - NVDLA0), pgd=85b367003, pud=85b367003, pmd=7fd4af003, pte=0

We are checking the detailed constraint with our internal team.
Will share more information later.

Thanks.

AastaLLL · June 9, 2021, 6:39am

Hi,

We confirmed that all of your models can work correctly on our next release.
Please wait for the announcement.

Thanks.

VivekKrishnan · June 15, 2021, 6:53am

Hi there, to confirm, what is the next release you are referring to? Jetpack 4.6? The next tensorrt release? We are evaluating whether we need to update our models given release timelines

kayccc · June 23, 2021, 3:04am

Yes, it will be fixed at the next release JetPack 4.6 which will be available late July, 2021.

system · June 25, 2021, 3:04am

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
[Deconvolution + DLA] The output result is different from the GPU runtime Jetson Xavier NX tensorrt , dla	8	978	August 11, 2021
Wrong result from DLA Jetson AGX Xavier nvbugs , dla	8	845	October 18, 2021
We want to use GPU+DLA. How do I use DLA when converting onnx to trt model? Is there a python sample Jetson Xavier NX jetson-inference	4	1072	September 19, 2021
Does DLA work faster than GPU in fp16 model? Jetson AGX Xavier dla	18	2716	June 8, 2022
Jetpack 4.3 DP DLA running Jetson AGX Xavier	5	776	October 18, 2021
[Xavier NX + DLA] does not support dynamic shapes, and CBUF size requirement Jetson Xavier NX tensorrt , nvbugs , dla	9	1817	October 18, 2021
Internal DLA error for Conv2d layer Jetson AGX Xavier tensorrt , dla	4	1231	September 27, 2021
Orin AGX TensorRT DLA export fails but conv specs are below DLA layer restrictions Jetson AGX Orin tensorrt , dla	3	381	February 6, 2024
Wrong results when running network on DLA instead of GPU Jetson AGX Xavier	14	1152	October 18, 2021
Cannot create DLA engine using trtexec on Xavier Jetson AGX Xavier tensorrt , dla	8	1022	July 1, 2022

Observing different output when running layer on DLA vs running on GPU

Related topics