Trt_pose on DLA


I’m trying to run Trt_pose on one of the DLA cores of the Xavier NX. However building the engine fails with:

[TensorRT] ERROR: ../builder/cudnnBuilder2.cpp (1757) - Assertion Error in operator(): 0 (et.region->getType() == RegionType::kNVM)

This is my code for building the engine (adapted from torch2trt to use DLA). Specifically using the densenet121_baseline_att model:

import torch
import trt_pose.models
from torch2trt import ConversionContext
import tensorrt as trt

def default_input_names(num_inputs):
    return ["input_%d" % i for i in range(num_inputs)]

def default_output_names(num_outputs):
    return ["output_%d" % i for i in range(num_outputs)]

human_pose = {"supercategory": "person", "id": 1, "name": "person", "keypoints": ["nose", "left_eye", "right_eye", "left_ear", "right_ear", "left_shoulder", "right_shoulder", "left_elbow", "right_elbow", "left_wrist", "right_wrist", "left_hip", "right_hip", "left_knee", "right_knee", "left_ankle", "right_ankle", "neck"], "skeleton": [[16, 14], [14, 12], [17, 15], [15, 13], [12, 13], [6, 8], [7, 9], [8, 10], [9, 11], [2, 3], [1, 2], [1, 3], [2, 4], [3, 5], [4, 6], [5, 7], [18, 1], [18, 6], [18, 7], [18, 12], [18, 13]]}

if __name__=="__main__":

    inputs = (torch.zeros((1, 3, 256, 256)).cuda(),)

    num_parts = len(human_pose['keypoints'])
    num_links = len(human_pose['skeleton'])
    module = trt_pose.models.densenet121_baseline_att(num_parts, 2 * num_links).cuda().eval()

    logger = trt.Logger(trt.Logger.VERBOSE)
    builder = trt.Builder(logger)

    # run once to get num outputs
    outputs = module(*inputs)
    if not isinstance(outputs, tuple) and not isinstance(outputs, list):
        outputs = (outputs,)

    input_names = default_input_names(len(inputs))
    output_names = default_output_names(len(outputs))

    network = builder.create_network()
    with ConversionContext(network) as ctx:

        ctx.add_inputs(inputs, input_names)

        outputs = module(*inputs)

        if not isinstance(outputs, tuple) and not isinstance(outputs, list):
            outputs = (outputs,)
        ctx.mark_outputs(outputs, output_names)

    builder.max_batch_size = 1
    config = builder.create_builder_config()

    config.max_workspace_size = 1 << 30



    config.default_device_type = trt.DeviceType.DLA
    config.DLA_core = 0

    # profile = builder.create_optimization_profile()
    # profile.set_shape(
    #     'input_0',                          # input tensor name
    #     (1, 3, 256, 256),  # min shape
    #     (1, 3, 256, 256),  # opt shape
    #     (1, 3, 256, 256))  # max shape
    # config.add_optimization_profile(profile)

    engine = builder.build_engine(network, config)

Edit to add: I’m on Jetpack 4.5


Would you mind to convert the model into ONNX with
And convert it into TensorRT engine with trtexec?

/usr/src/tensorrt/bin/trtexec --onnx=[model] --useDLACore=0 --allowGPUFallback --verbose


Sure, I’ll try that!

Converting to ONNX using the script works but building the engine using trtexec fails in the same way as my script. These are the last couple of lines (I can include the whole log if you need it):

[02/19/2021-09:47:25] [V] [TRT] Total Activation Memory: 33488896
[02/19/2021-09:47:25] [I] [TRT] Detected 1 inputs and 3 output network tensors.
[02/19/2021-09:47:25] [V] [TRT] Conv_4 + Relu_6 () Set Tactic Name: volta_first_layer_filter7x7_fwd
[02/19/2021-09:47:25] [V] [TRT] Builder timing cache: created 1971 entries, 4528 hit(s)
[02/19/2021-09:47:25] [E] [TRT] ../builder/cudnnBuilder2.cpp (1757) - Assertion Error in operator(): 0 (et.region->getType() == RegionType::kNVM)
[02/19/2021-09:47:25] [E] Engine creation failed
[02/19/2021-09:47:25] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=./densenet121_baseline_att_256x256_B_epoch_160.onnx --useDLACore=0 --allowGPUFallback --verbose

Any updates on this?

Hi @oscar.thorn ,

I’ve created a pull request to torch2trt to enable DLA support.

Please note, this is subject to change. You may be able to use this to get past the issue you’re facing. The instructions for usage are documented in the pull request.

Please let me know if you try this out, or have any questions. I’m very curious to hear if this helps your use case.


@jaybdub Thanks! This seems very convenient, hope it gets merged!

But unfortunately does not work for me. Same errors:

[TensorRT] VERBOSE: Block size 65536
[TensorRT] VERBOSE: Total Activation Memory: 49823744
[TensorRT] INFO: Detected 1 inputs and 2 output network tensors.
[TensorRT] VERBOSE: 0.densenet.features.conv0 [CONVOLUTION #1, DLA] torch.nn.Conv2d.forward(Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False), tensor(shape=[1, 3, 256, 256], dtype=torch.float32)) + 0.densenet.features.relu0 [RELU #1, DLA] torch.nn.ReLU.forward(ReLU(inplace=True), tensor(shape=[1, 64, 128, 128], dtype=torch.float32)) () Set Tactic Name: volta_first_layer_filter7x7_fwd
[TensorRT] VERBOSE: Builder timing cache: created 1861 entries, 1436 hit(s)
[TensorRT] ERROR: …/builder/cudnnBuilder2.cpp (1757) - Assertion Error in operator(): 0 (et.region->getType() == RegionType::kNVM)

So maybe a problem with TensorRT directly? Any idea what the error et.region->getType() == RegionType::kNVM signifies?


This is a known issue and first reported by Problem building TensorRT engines for DLA core.
We already fixed this in our internal branch and the fix will be available in the next major release.

Sorry for the inconvenience.

Thanks for the update! Good know it is being fixed.

When can we expect a release with the fix? A rough estimate is fine. Being able to use the DLA cores is really important for the product we are building with Xavier.


Unfortunately, we are not allowed to release any schedule here.
Let us check this with our internal team to see if any extra information we can share.


We’re going to have the next release at summer, please wait for tour announcement.


is this ix still on track for next release? will this happen in June eventually?
I would like to use my Jetson NX DLAs. So far I struggle to use the DLAs, they either do not support enough operations or have to interact with CPU/GPU, leading the low performance altogether.
A bit frustrating.

Yes, the next release will be available late July, 2021.

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.