Cannot run model exported from TLT on Jetson's DLA

Description

I am using TLT2.0 (docker image nvcr.io/nvidia/tlt-streamanalytics:v2.0_py3) to do transfer learning with detectnet_v2 resnet18. More precisely I am following this tutorial https://github.com/NVIDIA-AI-IOT/face-mask-detection. After training I am exporting the model with float16 precision as an .etl file.

Now I want to run the model with tensorRT on a Jetson Xavier AXG on the DLA. For that I am using the tlt-converter to generate the .engine/.trt file. Because I have tensorrt 6.0 I am using this converter https://developer.nvidia.com/tlt-converter-trt60. After that I am using trtexec to try to make inference on the DLA. Sadly the model only appears to run on the GPU.

Environment

TensorRT Version: 6.0
GPU Type: Xavier AGX
Operating System + Version: Jeptack 4.3

Steps To Reproduce

  • Exported the trained model with:
tlt-export detectnet_v2 \
            -o resnet18_detector.etl \
            -m resnet18_detector.tlt \
            -k key \
            --data_type fp16 
  • Then on the Jetson, converted the .etl model to a tensorrt engine with:
tlt-converter -k key \
-d "3,544,960"  \
-o "output_cov/Sigmoid,output_bbox/BiasAdd"  \
-e resnet18_detector.trt   \
 -m 16   \
 -t fp16   \
resnet18_detector.etl

But I got some messages that all operations run on GPU. I got this:

[INFO] 
[INFO] --------------- Layers running on DLA: 
[INFO] 
[INFO] --------------- Layers running on GPU: 
[INFO] conv1/convolution + activation_1/Relu, block_1a_conv_1/convolution + block_1a_relu_1/Relu, block_1a_conv_shortcut/convolution, block_1a_conv_2/convolution + add_1/add + block_1a_relu/Relu, block_1b_conv_1/convolution + block_1b_relu_1/Relu, block_1b_conv_2/convolution + add_2/add + block_1b_relu/Relu, block_2a_conv_1/convolution + block_2a_relu_1/Relu, block_2a_conv_shortcut/convolution, block_2a_conv_2/convolution + add_3/add + block_2a_relu/Relu, block_2b_conv_1/convolution + block_2b_relu_1/Relu, block_2b_conv_2/convolution + add_4/add + block_2b_relu/Relu, block_3a_conv_1/convolution + block_3a_relu_1/Relu, block_3a_conv_shortcut/convolution, block_3a_conv_2/convolution + add_5/add + block_3a_relu/Relu, block_3b_conv_1/convolution + block_3b_relu_1/Relu, block_3b_conv_2/convolution + add_6/add + block_3b_relu/Relu, block_4a_conv_1/convolution + block_4a_relu_1/Relu, block_4a_conv_shortcut/convolution, block_4a_conv_2/convolution + add_7/add + block_4a_relu/Relu, block_4b_conv_1/convolution + block_4b_relu_1/Relu, block_4b_conv_2/convolution + add_8/add + block_4b_relu/Relu, output_bbox/convolution, output_cov/convolution, output_cov/Sigmoid, 
[INFO] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[INFO] Detected 1 inputs and 2 output network tensors.
  • Finally I tried to run it on the DLA:
trtexec --loadEngine=resnet18_detector.trt --batch=1 --useDLACore=0 --fp16 --verbose

But it appears to be using the GPU (checked with jtop GPU consumption). Also because when run without the --useDLACore I got the exact same inference time.

The above mentioned tutorial showed that it was possible to run it in DLA. In which part am I messing it up and how can I make it run on the DLA?

Hi, This looks like a Jetson issue. We recommend you to raise it to the respective platform from the below link

Thanks!

1 Like

Ok thanks, just created a new topic there. Maybe you can delete now this topic