Description
I am using TLT2.0 (docker image nvcr.io/nvidia/tlt-streamanalytics:v2.0_py3) to do transfer learning with detectnet_v2 resnet18. More precisely I am following this tutorial GitHub - NVIDIA-AI-IOT/face-mask-detection: Face Mask Detection using NVIDIA Transfer Learning Toolkit (TLT) and DeepStream for COVID-19. After training I am exporting the model with float16 precision as an .etl file.
Now I want to run the model with tensorRT on a Jetson Xavier AXG on the DLA. For that I am using the tlt-converter
to generate the .engine/.trt file. Because I have tensorrt 6.0 I am using this converter https://developer.nvidia.com/tlt-converter-trt60. After that I am using trtexec
to try to make inference on the DLA. Sadly the model only appears to run on the GPU.
Environment
TensorRT Version: 6.0
GPU Type: Xavier AGX
Operating System + Version: Jeptack 4.3
Steps To Reproduce
- Exported the trained model with:
tlt-export detectnet_v2 \
-o resnet18_detector.etl \
-m resnet18_detector.tlt \
-k key \
--data_type fp16
- Then on the Jetson, converted the .etl model to a tensorrt engine with:
tlt-converter -k key \
-d "3,544,960" \
-o "output_cov/Sigmoid,output_bbox/BiasAdd" \
-e resnet18_detector.trt \
-m 16 \
-t fp16 \
resnet18_detector.etl
But I got some messages that all operations run on GPU. I got this:
[INFO]
[INFO] --------------- Layers running on DLA:
[INFO]
[INFO] --------------- Layers running on GPU:
[INFO] conv1/convolution + activation_1/Relu, block_1a_conv_1/convolution + block_1a_relu_1/Relu, block_1a_conv_shortcut/convolution, block_1a_conv_2/convolution + add_1/add + block_1a_relu/Relu, block_1b_conv_1/convolution + block_1b_relu_1/Relu, block_1b_conv_2/convolution + add_2/add + block_1b_relu/Relu, block_2a_conv_1/convolution + block_2a_relu_1/Relu, block_2a_conv_shortcut/convolution, block_2a_conv_2/convolution + add_3/add + block_2a_relu/Relu, block_2b_conv_1/convolution + block_2b_relu_1/Relu, block_2b_conv_2/convolution + add_4/add + block_2b_relu/Relu, block_3a_conv_1/convolution + block_3a_relu_1/Relu, block_3a_conv_shortcut/convolution, block_3a_conv_2/convolution + add_5/add + block_3a_relu/Relu, block_3b_conv_1/convolution + block_3b_relu_1/Relu, block_3b_conv_2/convolution + add_6/add + block_3b_relu/Relu, block_4a_conv_1/convolution + block_4a_relu_1/Relu, block_4a_conv_shortcut/convolution, block_4a_conv_2/convolution + add_7/add + block_4a_relu/Relu, block_4b_conv_1/convolution + block_4b_relu_1/Relu, block_4b_conv_2/convolution + add_8/add + block_4b_relu/Relu, output_bbox/convolution, output_cov/convolution, output_cov/Sigmoid,
[INFO] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[INFO] Detected 1 inputs and 2 output network tensors.
- Finally I tried to run it on the DLA:
trtexec --loadEngine=resnet18_detector.trt --batch=1 --useDLACore=0 --fp16 --verbose
But it appears to be using the GPU (checked with jtop GPU consumption). Also because when run without the --useDLACore
I got the exact same inference time.
The above mentioned tutorial showed that it was possible to run it in DLA. In which part am I messing it up and how can I make it run on the DLA?