Trt_pose on DLA

Hi!

I’m trying to run Trt_pose on one of the DLA cores of the Xavier NX. However building the engine fails with:

[TensorRT] ERROR: ../builder/cudnnBuilder2.cpp (1757) - Assertion Error in operator(): 0 (et.region->getType() == RegionType::kNVM)

This is my code for building the engine (adapted from torch2trt to use DLA). Specifically using the densenet121_baseline_att model:

import torch
import trt_pose.models
from torch2trt import ConversionContext
import tensorrt as trt

def default_input_names(num_inputs):
    return ["input_%d" % i for i in range(num_inputs)]

def default_output_names(num_outputs):
    return ["output_%d" % i for i in range(num_outputs)]

human_pose = {"supercategory": "person", "id": 1, "name": "person", "keypoints": ["nose", "left_eye", "right_eye", "left_ear", "right_ear", "left_shoulder", "right_shoulder", "left_elbow", "right_elbow", "left_wrist", "right_wrist", "left_hip", "right_hip", "left_knee", "right_knee", "left_ankle", "right_ankle", "neck"], "skeleton": [[16, 14], [14, 12], [17, 15], [15, 13], [12, 13], [6, 8], [7, 9], [8, 10], [9, 11], [2, 3], [1, 2], [1, 3], [2, 4], [3, 5], [4, 6], [5, 7], [18, 1], [18, 6], [18, 7], [18, 12], [18, 13]]}

if __name__=="__main__":

    inputs = (torch.zeros((1, 3, 256, 256)).cuda(),)

    num_parts = len(human_pose['keypoints'])
    num_links = len(human_pose['skeleton'])
    module = trt_pose.models.densenet121_baseline_att(num_parts, 2 * num_links).cuda().eval()
    module.load_state_dict(torch.load("./model/densenet121_baseline_att_256x256_B_epoch_160.pth"))

    logger = trt.Logger(trt.Logger.VERBOSE)
    builder = trt.Builder(logger)

    # run once to get num outputs
    outputs = module(*inputs)
    if not isinstance(outputs, tuple) and not isinstance(outputs, list):
        outputs = (outputs,)

    input_names = default_input_names(len(inputs))
    output_names = default_output_names(len(outputs))

    network = builder.create_network()
    with ConversionContext(network) as ctx:

        ctx.add_inputs(inputs, input_names)

        outputs = module(*inputs)

        if not isinstance(outputs, tuple) and not isinstance(outputs, list):
            outputs = (outputs,)
        ctx.mark_outputs(outputs, output_names)

    builder.max_batch_size = 1
    config = builder.create_builder_config()

    config.max_workspace_size = 1 << 30

    config.set_flag(trt.BuilderFlag.FP16)

    config.set_flag(trt.BuilderFlag.GPU_FALLBACK)

    config.default_device_type = trt.DeviceType.DLA
    config.DLA_core = 0

    # profile = builder.create_optimization_profile()
    # profile.set_shape(
    #     'input_0',                          # input tensor name
    #     (1, 3, 256, 256),  # min shape
    #     (1, 3, 256, 256),  # opt shape
    #     (1, 3, 256, 256))  # max shape
    # config.add_optimization_profile(profile)

    engine = builder.build_engine(network, config)

Edit to add: I’m on Jetpack 4.5

Hi,

Would you mind to convert the model into ONNX with export_for_isaac.py.
And convert it into TensorRT engine with trtexec?

/usr/src/tensorrt/bin/trtexec --onnx=[model] --useDLACore=0 --allowGPUFallback --verbose

Thanks.

Sure, I’ll try that!

Converting to ONNX using the script works but building the engine using trtexec fails in the same way as my script. These are the last couple of lines (I can include the whole log if you need it):

[02/19/2021-09:47:25] [V] [TRT] Total Activation Memory: 33488896
[02/19/2021-09:47:25] [I] [TRT] Detected 1 inputs and 3 output network tensors.
[02/19/2021-09:47:25] [V] [TRT] Conv_4 + Relu_6 () Set Tactic Name: volta_first_layer_filter7x7_fwd
[02/19/2021-09:47:25] [V] [TRT] Builder timing cache: created 1971 entries, 4528 hit(s)
[02/19/2021-09:47:25] [E] [TRT] ../builder/cudnnBuilder2.cpp (1757) - Assertion Error in operator(): 0 (et.region->getType() == RegionType::kNVM)
[02/19/2021-09:47:25] [E] Engine creation failed
[02/19/2021-09:47:25] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=./densenet121_baseline_att_256x256_B_epoch_160.onnx --useDLACore=0 --allowGPUFallback --verbose

Any updates on this?

Hi @oscar.thorn ,

I’ve created a pull request to torch2trt to enable DLA support.

Please note, this is subject to change. You may be able to use this to get past the issue you’re facing. The instructions for usage are documented in the pull request.

Please let me know if you try this out, or have any questions. I’m very curious to hear if this helps your use case.

Best,
John

@jaybdub Thanks! This seems very convenient, hope it gets merged!

But unfortunately does not work for me. Same errors:

[TensorRT] VERBOSE: Block size 65536
[TensorRT] VERBOSE: Total Activation Memory: 49823744
[TensorRT] INFO: Detected 1 inputs and 2 output network tensors.
[TensorRT] VERBOSE: 0.densenet.features.conv0 [CONVOLUTION #1, DLA] torch.nn.Conv2d.forward(Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False), tensor(shape=[1, 3, 256, 256], dtype=torch.float32)) + 0.densenet.features.relu0 [RELU #1, DLA] torch.nn.ReLU.forward(ReLU(inplace=True), tensor(shape=[1, 64, 128, 128], dtype=torch.float32)) () Set Tactic Name: volta_first_layer_filter7x7_fwd
[TensorRT] VERBOSE: Builder timing cache: created 1861 entries, 1436 hit(s)
[TensorRT] ERROR: …/builder/cudnnBuilder2.cpp (1757) - Assertion Error in operator(): 0 (et.region->getType() == RegionType::kNVM)

So maybe a problem with TensorRT directly? Any idea what the error et.region->getType() == RegionType::kNVM signifies?