Trt_pose on DLA

oscar.thorn · February 18, 2021, 4:15pm

Hi!

I’m trying to run Trt_pose on one of the DLA cores of the Xavier NX. However building the engine fails with:

[TensorRT] ERROR: ../builder/cudnnBuilder2.cpp (1757) - Assertion Error in operator(): 0 (et.region->getType() == RegionType::kNVM)

This is my code for building the engine (adapted from torch2trt to use DLA). Specifically using the densenet121_baseline_att model:

import torch
import trt_pose.models
from torch2trt import ConversionContext
import tensorrt as trt

def default_input_names(num_inputs):
    return ["input_%d" % i for i in range(num_inputs)]

def default_output_names(num_outputs):
    return ["output_%d" % i for i in range(num_outputs)]

human_pose = {"supercategory": "person", "id": 1, "name": "person", "keypoints": ["nose", "left_eye", "right_eye", "left_ear", "right_ear", "left_shoulder", "right_shoulder", "left_elbow", "right_elbow", "left_wrist", "right_wrist", "left_hip", "right_hip", "left_knee", "right_knee", "left_ankle", "right_ankle", "neck"], "skeleton": [[16, 14], [14, 12], [17, 15], [15, 13], [12, 13], [6, 8], [7, 9], [8, 10], [9, 11], [2, 3], [1, 2], [1, 3], [2, 4], [3, 5], [4, 6], [5, 7], [18, 1], [18, 6], [18, 7], [18, 12], [18, 13]]}

if __name__=="__main__":

    inputs = (torch.zeros((1, 3, 256, 256)).cuda(),)

    num_parts = len(human_pose['keypoints'])
    num_links = len(human_pose['skeleton'])
    module = trt_pose.models.densenet121_baseline_att(num_parts, 2 * num_links).cuda().eval()
    module.load_state_dict(torch.load("./model/densenet121_baseline_att_256x256_B_epoch_160.pth"))

    logger = trt.Logger(trt.Logger.VERBOSE)
    builder = trt.Builder(logger)

    # run once to get num outputs
    outputs = module(*inputs)
    if not isinstance(outputs, tuple) and not isinstance(outputs, list):
        outputs = (outputs,)

    input_names = default_input_names(len(inputs))
    output_names = default_output_names(len(outputs))

    network = builder.create_network()
    with ConversionContext(network) as ctx:

        ctx.add_inputs(inputs, input_names)

        outputs = module(*inputs)

        if not isinstance(outputs, tuple) and not isinstance(outputs, list):
            outputs = (outputs,)
        ctx.mark_outputs(outputs, output_names)

    builder.max_batch_size = 1
    config = builder.create_builder_config()

    config.max_workspace_size = 1 << 30

    config.set_flag(trt.BuilderFlag.FP16)

    config.set_flag(trt.BuilderFlag.GPU_FALLBACK)

    config.default_device_type = trt.DeviceType.DLA
    config.DLA_core = 0

    # profile = builder.create_optimization_profile()
    # profile.set_shape(
    #     'input_0',                          # input tensor name
    #     (1, 3, 256, 256),  # min shape
    #     (1, 3, 256, 256),  # opt shape
    #     (1, 3, 256, 256))  # max shape
    # config.add_optimization_profile(profile)

    engine = builder.build_engine(network, config)

Edit to add: I’m on Jetpack 4.5

AastaLLL · February 19, 2021, 3:36am

Hi,

Would you mind to convert the model into ONNX with export_for_isaac.py.
And convert it into TensorRT engine with trtexec?

/usr/src/tensorrt/bin/trtexec --onnx=[model] --useDLACore=0 --allowGPUFallback --verbose

Thanks.

oscar.thorn · February 19, 2021, 8:00am

Sure, I’ll try that!

oscar.thorn · February 19, 2021, 8:50am

Converting to ONNX using the script works but building the engine using trtexec fails in the same way as my script. These are the last couple of lines (I can include the whole log if you need it):

[02/19/2021-09:47:25] [V] [TRT] Total Activation Memory: 33488896
[02/19/2021-09:47:25] [I] [TRT] Detected 1 inputs and 3 output network tensors.
[02/19/2021-09:47:25] [V] [TRT] Conv_4 + Relu_6 () Set Tactic Name: volta_first_layer_filter7x7_fwd
[02/19/2021-09:47:25] [V] [TRT] Builder timing cache: created 1971 entries, 4528 hit(s)
[02/19/2021-09:47:25] [E] [TRT] ../builder/cudnnBuilder2.cpp (1757) - Assertion Error in operator(): 0 (et.region->getType() == RegionType::kNVM)
[02/19/2021-09:47:25] [E] Engine creation failed
[02/19/2021-09:47:25] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=./densenet121_baseline_att_256x256_B_epoch_160.onnx --useDLACore=0 --allowGPUFallback --verbose

oscar.thorn · February 23, 2021, 3:04pm

Any updates on this?

jaybdub · February 25, 2021, 2:32am

Hi @oscar.thorn ,

I’ve created a pull request to torch2trt to enable DLA support.

github.com/NVIDIA-AI-IOT/torch2trt

Dla support

NVIDIA-AI-IOT:master ← jaybdub:dla_support

opened 02:28AM - 25 Feb 21 UTC

jaybdub

+9144 -816

This PR enables DLA support through torch2trt. Please note this is subject to c…hange. ### Todo - [x] concurrent model execution profiling (DLA should free up GPU, also there are two DLAs) - [x] 2 model GPU 0 model DLA - [x] 1 model GPU 1 model DLA - [x] 0 model GPU 2 model DLA - [x] increment version to enable backtracking to legacy conversion - [x] update documentation with DLA instructions / benchmarks - [ ] sanity check models on imagenet data ### Basic usage This will enable DLA usage globally by setting the default device type. If a particular layer is not supported by DLA, it will fall back to GPU unless ``gpu_fallback=False`` is set (it is True by default). Please note, either ``fp16_mode=True`` or ``int8_mode=True`` are required for DLA. ```python import torch from torch2trt import trt, torch2trt from torchvision.models import resnet18 model = resnet18(pretrained=True).cuda().eval() data = torch.randn(1, 3, 224, 224).cuda() model_trt = torch2trt(model, [data], fp16_mode=True, default_device_type=trt.DeviceType.DLA, dla_core=0) ``` ### Set device for particular modules In some cases you may want only a particular part of the model to run on DLA. This may be useful if your model has a large block that can run on DLA, but other parts with mixed supported / unsupported components which may occur overhead cost. This is done at a module-level granularity. Each lower-level set will override a higher level module, but only apply to layers added within the module. For example, here we convert the resnet18 blocks ``layer1`` and ``layer2`` and the first block in ``layer3`` to run on DLA. ```python model_trt = torch2trt(model, [data], default_device_type=trt.DeviceType.GPU, fp16_mode=True, dla_core=0, device_types={ model.layer1: trt.DeviceType.DLA, model.layer2: trt.DeviceType.DLA, model.layer3[0]: trt.DeviceType.DLA }) ``` You could do the inverse of this if you wanted. ```python model_trt = torch2trt(model, [data], default_device_type=trt.DeviceType.DLA, fp16_mode=True, device_types={ model.layer1: trt.DeviceType.GPU, model.layer2: trt.DeviceType.GPU, model.layer3[0]: trt.DeviceType.GPU }) ``` In any instance, a DLA layer will run on GPU if ``gpu_fallback=True``. If you would prefer to disable this behavior, you can set ``gpu_fallback=False``, but TensorRT optimization may fail internally. TODO - validate existing test cases. may be cool to log DLA supported layers https://nvidia-ai-iot.github.io/torch2trt/master/converters.html - perform sanity checks on real world model / data

Please note, this is subject to change. You may be able to use this to get past the issue you’re facing. The instructions for usage are documented in the pull request.

Please let me know if you try this out, or have any questions. I’m very curious to hear if this helps your use case.

Best,
John

oscar.thorn · March 1, 2021, 9:26am

@jaybdub Thanks! This seems very convenient, hope it gets merged!

But unfortunately does not work for me. Same errors:

[TensorRT] VERBOSE: Block size 65536
[TensorRT] VERBOSE: Total Activation Memory: 49823744
[TensorRT] INFO: Detected 1 inputs and 2 output network tensors.
[TensorRT] VERBOSE: 0.densenet.features.conv0 [CONVOLUTION #1, DLA] torch.nn.Conv2d.forward(Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False), tensor(shape=[1, 3, 256, 256], dtype=torch.float32)) + 0.densenet.features.relu0 [RELU #1, DLA] torch.nn.ReLU.forward(ReLU(inplace=True), tensor(shape=[1, 64, 128, 128], dtype=torch.float32)) () Set Tactic Name: volta_first_layer_filter7x7_fwd
[TensorRT] VERBOSE: Builder timing cache: created 1861 entries, 1436 hit(s)
[TensorRT] ERROR: …/builder/cudnnBuilder2.cpp (1757) - Assertion Error in operator(): 0 (et.region->getType() == RegionType::kNVM)

So maybe a problem with TensorRT directly? Any idea what the error et.region->getType() == RegionType::kNVM signifies?

AastaLLL · March 16, 2021, 8:11am

Hi,

This is a known issue and first reported by Problem building TensorRT engines for DLA core.
We already fixed this in our internal branch and the fix will be available in the next major release.

Sorry for the inconvenience.
Thanks.

oscar.thorn · March 16, 2021, 10:19am

Thanks for the update! Good know it is being fixed.

When can we expect a release with the fix? A rough estimate is fine. Being able to use the DLA cores is really important for the product we are building with Xavier.

AastaLLL · March 16, 2021, 11:07am

Hi,

Unfortunately, we are not allowed to release any schedule here.
Let us check this with our internal team to see if any extra information we can share.

Thanks.

kayccc · March 17, 2021, 7:14am

We’re going to have the next release at summer, please wait for tour announcement.

Thanks

tetsfr · June 8, 2021, 11:47am

hi
is this ix still on track for next release? will this happen in June eventually?
I would like to use my Jetson NX DLAs. So far I struggle to use the DLAs, they either do not support enough operations or have to interact with CPU/GPU, leading the low performance altogether.
A bit frustrating.

kayccc · June 16, 2021, 5:16am

Yes, the next release will be available late July, 2021.

system · June 25, 2021, 6:17am

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
DLA_STANDALONE error in forceToUseNvmIO Jetson AGX Xavier dla	15	1269	February 9, 2023
Convert model to TensorRT with DLA \| DLA Node compilation Failed TensorRT	3	910	October 12, 2021
Cannot create DLA engine using trtexec on Xavier Jetson AGX Xavier tensorrt , dla	8	1007	July 1, 2022
Xavier NX does not support adaptative average pooling on DLA? Jetson Xavier NX tensorrt	27	1112	October 11, 2023
Cannot build a TensorRT engine for DLA from a large ONNX file Jetson Xavier NX tensorrt , nvbugs , dla	12	2616	July 21, 2021
Engine creation fails when using DLA with GPU fallback Jetson AGX Xavier tensorrt , dla	11	1951	March 22, 2022
Tensorrt Python API has a bug in DLA usage Jetson AGX Xavier tensorrt	11	626	August 17, 2022
API usage error of torch2trt on Jetson Orin nano Jetson Orin Nano pytorch	10	1464	September 12, 2023
Simple 2 layer U-Net breaks TensorRT conversion TensorRT	14	1245	October 12, 2021
[TRT] [E] 3: [builderConfig.cpp::canRunOnDLA::493] Error Code 3: API Usage Error on Jetson orin Nano Jetson Orin Nano tensorrt	25	1204	November 20, 2023

Trt_pose on DLA

Related topics