DLA bindOutputTensor failed when inferring. TensorRT 7.1.3

opluss · April 27, 2021, 2:26am

After we transform our model form ONNX using TensorRT7.1.3 on Jetson AGX, we got the error likes below when inferring:

NVMEDIA_DLA :  495, ERROR: bindOutputTensor failed. err: 0xB
NVMEDIA_DLA : 1920, ERROR: BindOutputTensorArgs failed (Output). status: 0x7.
../rtExt/dla/native/dlaUtils.cpp (194) - DLA Error in submit: 7 (Failure to submit program to DLA engine.)
FAILED_EXECUTION: std::exception
NVMEDIA_DLA :  885, ERROR: runtime registerEvent failed. err: 0x4.
NVMEDIA_DLA : 1849, ERROR: RequestSubmitEvents failed. status: 0x7.
../rtExt/dla/native/dlaUtils.cpp (194) - DLA Error in submit: 7 (Failure to submit program to DLA engine.)
FAILED_EXECUTION: std::exception
NVMEDIA_DLA :  885, ERROR: runtime registerEvent failed. err: 0x4.
NVMEDIA_DLA : 1849, ERROR: RequestSubmitEvents failed. status: 0x7.
../rtExt/dla/native/dlaUtils.cpp (194) - DLA Error in submit: 7 (Failure to submit program to DLA engine.)
FAILED_EXECUTION: std::exception
...

the partial struct of our model likes this:

import torch
import torch.nn as nn
# define network
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv = nn.Conv2d(256, 2, 3, 1, 1)
    def forward(self, x):
        x = torch.relu(x)
        x1 = x
        x2 = self.conv(x)
        x = torch.cat([x1, x2], 1)
        return x
net = Net()
a = torch.randn(1,256,40,40)
torch.onnx.export(net, a, "concat.onnx", verbose=True, opset_version=11)

The generated ONNX model is here:
concat.onnx (18.4 KB)

Maybe we are careless, although we check the below link, we don’t know witch layer or resource cause this error. After we remove the relu-opt or conv-opt or concat-opt, the error are disappeared. And the function “canRunOnDLA” also return true for every layers.

AastaLLL · April 27, 2021, 6:21am

Hi,

This is a know issue: I get mc-err on Jetson Xavier NX.
Since DLA doesn’t support concat operation with input channels that are not multiples of x16 (for fp16) and x32 (for int8).
It will causes some unexpected error.

We already add this support to our internal DLA branch and also confirmed your model can work well with it.
The package will be available in our next JetPack release.

Thanks.

opluss · April 27, 2021, 6:50am

Hi,

Thanks for your reply.
I also test the concat with axis=1 , the shape of inputs is [1, 32, 40, 40] and [1, 32, 40, 40] then I got the same error.
Is the shape[1] means channel?

My code with test liks below:

import torch
import torch.nn as nn
# define network
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv = nn.Conv2d(256, 32, 3, 1, 1)
    def forward(self, x):
        x = torch.relu(x)
        x1 = self.conv(x)
        x2 = self.conv(x)
        print(x1.shape)
        print(x2.shape)
        x = torch.cat([x1, x2], 1)
        print(x.shape)
        return x
net = Net()
a = torch.randn(1,256,40,40)
torch.onnx.export(net, a, "concat.onnx", verbose=True, opset_version=11)

Thanks.

AastaLLL · May 11, 2021, 9:27am

Hi,

The failure is caused by the too large channel-1 size.
Update the size into 128 can be a temporal WAR to inference with DLA.

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv = nn.Conv2d(128, 32, 3, 1, 1)
    ...
a = torch.randn(1,128,40,40)

This is also a limitation issue and is improved in our next DLA release.
We test your model ((1,256,40,40)) with our next JetPack release and it can run correctly without error.

Thanks.