Hey there, I am running into some errors when trying to run certain layers on the DLA. Here is the setup I am currently using for running models.
Platform: Jetson AGX Xavier
Jetpack Version: 4.4.1
TensorRT Version: 7.1.3
I am currently using a workflow where I generate a model in Pytorch, convert it to onnx, and then load in trtexec using the --onnx parameter
Pytorch: 1.4.5
Onnx: 1.6.0
Here is the code I am using to export the onnx models:
torch.onnx.export(
pytorch_model,
input_tensor,
“model.onnx”,
output_names=[“output”],
opset_version=7,
do_constant_folding=True,
export_params=True)
For each model, I am passing an input tensor with shape (1, n, 32, 32)
To run the model on my Xavier, I am using the command:
trtexec --fp16 --workspace=64 --iterations=1 --warmUp=0 --duration=0 --onnx=model.onnx --exportOutput=output.json
When running on the DLA, I use the command:
trtexec --fp16 --workspace=64 --iterations=1 --warmUp=0 --duration=0 --useDLACore=0 --onnx=model.onnx --exportOutput=output.json
I have been validating whether the DLA produces the same output as running the layer on the GPU and I have been getting some interesting results. Specifically, I have been trying to implement convolutions as follows (for different values of n):
from torch import nn
conv = nn.Conv2d(n, n, kernel_size=3, stride=1, padding=1, groups=n, bias=False)
I have run this validation with n in [4, 64] and have no issues (gpu output equals dla output). However, beyond n=64, I run into some interesting results.
When I set n to 68, I notice that the exported output file for the DLA contains NAN values and the GPU and DLA contain different values
model68.onnx (2.7 KB)
When I set n to 96, there are no Nan values but the GPU and DLA contain different values
model96.onnx (3.7 KB)
When I set n to 128 or 256 or 512 or , the GPU and the DLA tensors match exactly. It seems that powers of 2 work fine. Is this expected behavior for the DLA tensors?
model256.onnx (9.3 KB)
model512.onnx (18.3 KB)
With this setup, when I set n to 1024, the layer fails to load onto the DLA. I assume that when using fp16 mode, the weight tensor size (1024, 1, 3, 3) or input size (1, 1024, 32, 32) exceeds the maximum size for the convolution buffer.
model1024.onnx (36.3 KB)
It seems to state on the nvdla hwarch page that that Convolution Buffer has the following constraints:
- Range of values: 4KB~32KB
This seems to be different from the dla support page which defines the following constraints:
- The maximum size of weights supported by DLA is 512 MB.
- A DLA network can only support up to 1 GB of intermediate tensor data. Tensors that are the input and output to the DLA graph are not counted against this limit. TensorRT will reject networks that exceed this limit that are built without GPU fallback enabled.
- Number of output maps must be in the range [1, 8192].
- Number of groups must be in the range [1, 8192] for operations using the formats TensorFormat::kLINEAR, TensorFormat::kCHW16, and TensorFormat::kCHW32.
TLDR:
When building pytorch models, do convolutions only support group parameters with a power of 2?
How do I determine how big a layer can be before it’s not supported on the DLA (in fp16 mode)?
Is this issue resolved with the next Jetpack release