Failed to alloc CUDA mapped memory for tensor input when load tiny_yolov2.onnx

tairen · October 28, 2019, 5:12pm

Hi, all,
I have a difficulty to convert the customized trained the tiny_yolov2 model into tensorrt format. So I plan to test the tiny_yolov2.onnx model from
https:/github.com/onnx/models/tree/master/vision/object_detection_segmentation/tiny_yolov2

I create a “test” folder inside of “jetson-inference/data/networks/test/” and I put the onnx model inside together with the labels file;
Then, in the folder “/jetson-inference/python/examples”, I run the following command:

$ NET==/home/***/jetson-inference/data/networks/test/
$ python3 detectnet-console.py --model=$NET/Model.onnx --label=$NET/labels.txt  /home/***/jetson-inference/data/images/peds_1.jpg /home/***/output.jpg --input_blob=data –output_bbox=bboxes

But I get following error:

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
jetson.inference.init.py
jetson.inference – initializing Python 3.6 bindings…
jetson.inference – registering module types…
jetson.inference – done registering module types
jetson.inference – done Python 3.6 binding initialization
jetson.utils.init.py
jetson.utils – initializing Python 3.6 bindings…
jetson.utils – registering module functions…
jetson.utils – done registering module functions
jetson.utils – registering module types…
jetson.utils – done registering module types
jetson.utils – done Python 3.6 binding initialization
[image] loaded ‘/home//jetson-inference/data/images/peds_1.jpg’ (1920 x 1080, 3 channels)
jetson.inference – PyTensorNet_New()
jetson.inference – PyDetectNet_Init()
jetson.inference – detectNet loading network using argv command line params
jetson.inference – detectNet.init() argv[0] = ‘detectnet-console.py’
jetson.inference – detectNet.init() argv[1] = '–model=/home//jetson-inference/data/networks/test/Model.onnx’
jetson.inference – detectNet.init() argv[2] = ‘–label=/home//jetson-inference/data/networks/test/labels.txt’
jetson.inference – detectNet.init() argv[3] = '/home//jetson-inference/data/images/peds_1.jpg’
jetson.inference – detectNet.init() argv[4] = ‘/home/***/output.jpg’
jetson.inference – detectNet.init() argv[5] = ‘–input_blob=data’
jetson.inference – detectNet.init() argv[6] = ‘–output_bbox=bboxes’

detectNet – loading detection network model from:
– prototxt NULL
– model /home/***/jetson-inference/data/networks/test/Model.onnx
– input_blob ‘data’
– output_cvg ‘coverage’
– output_bbox ‘bboxes’
– mean_pixel 0.000000
– mean_binary NULL
– class_labels NULL
– threshold 0.500000
– batch_size 1

[TRT] TensorRT version 5.1.6
[TRT] loading NVIDIA plugins…
[TRT] Plugin Creator registration succeeded - GridAnchor_TRT
[TRT] Plugin Creator registration succeeded - NMS_TRT
[TRT] Plugin Creator registration succeeded - Reorg_TRT
[TRT] Plugin Creator registration succeeded - Region_TRT
[TRT] Plugin Creator registration succeeded - Clip_TRT
[TRT] Plugin Creator registration succeeded - LReLU_TRT
[TRT] Plugin Creator registration succeeded - PriorBox_TRT
[TRT] Plugin Creator registration succeeded - Normalize_TRT
[TRT] Plugin Creator registration succeeded - RPROI_TRT
[TRT] Plugin Creator registration succeeded - BatchedNMS_TRT
[TRT] completed loading NVIDIA plugins.
[TRT] detected model format - ONNX (extension ‘.onnx’)
[TRT] desired precision specified for GPU: FASTEST
[TRT] requested fasted precision for device GPU without providing valid calibrator, disabling INT8
[TRT] native precisions detected for GPU: FP32, FP16
[TRT] selecting fastest native precision for GPU: FP16
[TRT] attempting to open engine cache file /home//jetson-inference/data/networks/test/Model.onnx.1.1.GPU.FP16.engine
[TRT] cache file not found, profiling network model on device GPU
[TRT] device GPU, loading /usr/bin/ /home//jetson-inference/data/networks/test/Model.onnx

Input filename: /home/***/jetson-inference/data/networks/test/Model.onnx
ONNX IR version: 0.0.5
Opset version: 8
Producer name: OnnxMLTools
Producer version: 1.5.2
Domain: onnxconverter-common
Model version: 0
Doc string: The Tiny YOLO network from the paper ‘YOLO9000: Better, Faster, Stronger’ (2016), arXiv:1612.08242

WARNING: ONNX model has a newer ir_version (0.0.5) than this parser was built against (0.0.3).
[TRT] scalerPreprocessor_scaled:Mul → (3, 416, 416)
[TRT] image2:Add → (3, 416, 416)
[TRT] /home/erisuser/p4sw/sw/gpgpu/MachineLearning/DIT/release/5.1/parsers/onnxOpenSource/builtin_op_importers.cpp:771: Convolution input dimensions: (3, 416, 416)
[TRT] /home/erisuser/p4sw/sw/gpgpu/MachineLearning/DIT/release/5.1/parsers/onnxOpenSource/builtin_op_importers.cpp:835: Using kernel: (3, 3), strides: (1, 1), padding: (0, 0), dilations: (1, 1), numOutputs: 16
[TRT] /home/erisuser/p4sw/sw/gpgpu/MachineLearning/DIT/release/5.1/parsers/onnxOpenSource/builtin_op_importers.cpp:836: Convolution output dimensions: (16, 416, 416)
[TRT] convolution2d_1_output:Conv → (16, 416, 416)
[TRT] batchnormalization_1_output:BatchNormalization → (16, 416, 416)
[TRT] leakyrelu_1_output:LeakyRelu → (16, 416, 416)
[TRT] maxpooling2d_1_output:MaxPool → (16, 208, 208)
[TRT] /home/erisuser/p4sw/sw/gpgpu/MachineLearning/DIT/release/5.1/parsers/onnxOpenSource/builtin_op_importers.cpp:771: Convolution input dimensions: (16, 208, 208)
[TRT] /home/erisuser/p4sw/sw/gpgpu/MachineLearning/DIT/release/5.1/parsers/onnxOpenSource/builtin_op_importers.cpp:835: Using kernel: (3, 3), strides: (1, 1), padding: (0, 0), dilations: (1, 1), numOutputs: 32
[TRT] /home/erisuser/p4sw/sw/gpgpu/MachineLearning/DIT/release/5.1/parsers/onnxOpenSource/builtin_op_importers.cpp:836: Convolution output dimensions: (32, 208, 208)
[TRT] convolution2d_2_output:Conv → (32, 208, 208)
[TRT] batchnormalization_2_output:BatchNormalization → (32, 208, 208)
[TRT] leakyrelu_2_output:LeakyRelu → (32, 208, 208)
[TRT] maxpooling2d_2_output:MaxPool → (32, 104, 104)
[TRT] /home/erisuser/p4sw/sw/gpgpu/MachineLearning/DIT/release/5.1/parsers/onnxOpenSource/builtin_op_importers.cpp:771: Convolution input dimensions: (32, 104, 104)
[TRT] /home/erisuser/p4sw/sw/gpgpu/MachineLearning/DIT/release/5.1/parsers/onnxOpenSource/builtin_op_importers.cpp:835: Using kernel: (3, 3), strides: (1, 1), padding: (0, 0), dilations: (1, 1), numOutputs: 64
[TRT] /home/erisuser/p4sw/sw/gpgpu/MachineLearning/DIT/release/5.1/parsers/onnxOpenSource/builtin_op_importers.cpp:836: Convolution output dimensions: (64, 104, 104)
[TRT] convolution2d_3_output:Conv → (64, 104, 104)
[TRT] batchnormalization_3_output:BatchNormalization → (64, 104, 104)
[TRT] leakyrelu_3_output:LeakyRelu → (64, 104, 104)
[TRT] maxpooling2d_3_output:MaxPool → (64, 52, 52)
[TRT] /home/erisuser/p4sw/sw/gpgpu/MachineLearning/DIT/release/5.1/parsers/onnxOpenSource/builtin_op_importers.cpp:771: Convolution input dimensions: (64, 52, 52)
[TRT] /home/erisuser/p4sw/sw/gpgpu/MachineLearning/DIT/release/5.1/parsers/onnxOpenSource/builtin_op_importers.cpp:835: Using kernel: (3, 3), strides: (1, 1), padding: (0, 0), dilations: (1, 1), numOutputs: 128
[TRT] /home/erisuser/p4sw/sw/gpgpu/MachineLearning/DIT/release/5.1/parsers/onnxOpenSource/builtin_op_importers.cpp:836: Convolution output dimensions: (128, 52, 52)
[TRT] convolution2d_4_output:Conv → (128, 52, 52)
[TRT] batchnormalization_4_output:BatchNormalization → (128, 52, 52)
[TRT] leakyrelu_4_output:LeakyRelu → (128, 52, 52)
[TRT] maxpooling2d_4_output:MaxPool → (128, 26, 26)
[TRT] /home/erisuser/p4sw/sw/gpgpu/MachineLearning/DIT/release/5.1/parsers/onnxOpenSource/builtin_op_importers.cpp:771: Convolution input dimensions: (128, 26, 26)
[TRT] /home/erisuser/p4sw/sw/gpgpu/MachineLearning/DIT/release/5.1/parsers/onnxOpenSource/builtin_op_importers.cpp:835: Using kernel: (3, 3), strides: (1, 1), padding: (0, 0), dilations: (1, 1), numOutputs: 256
[TRT] /home/erisuser/p4sw/sw/gpgpu/MachineLearning/DIT/release/5.1/parsers/onnxOpenSource/builtin_op_importers.cpp:836: Convolution output dimensions: (256, 26, 26)
[TRT] convolution2d_5_output:Conv → (256, 26, 26)
[TRT] batchnormalization_5_output:BatchNormalization → (256, 26, 26)
[TRT] leakyrelu_5_output:LeakyRelu → (256, 26, 26)
[TRT] maxpooling2d_5_output:MaxPool → (256, 13, 13)
[TRT] /home/erisuser/p4sw/sw/gpgpu/MachineLearning/DIT/release/5.1/parsers/onnxOpenSource/builtin_op_importers.cpp:771: Convolution input dimensions: (256, 13, 13)
[TRT] /home/erisuser/p4sw/sw/gpgpu/MachineLearning/DIT/release/5.1/parsers/onnxOpenSource/builtin_op_importers.cpp:835: Using kernel: (3, 3), strides: (1, 1), padding: (0, 0), dilations: (1, 1), numOutputs: 512
[TRT] /home/erisuser/p4sw/sw/gpgpu/MachineLearning/DIT/release/5.1/parsers/onnxOpenSource/builtin_op_importers.cpp:836: Convolution output dimensions: (512, 13, 13)
[TRT] convolution2d_6_output:Conv → (512, 13, 13)
[TRT] batchnormalization_6_output:BatchNormalization → (512, 13, 13)
[TRT] leakyrelu_6_output:LeakyRelu → (512, 13, 13)
[TRT] maxpooling2d_6_output:MaxPool → (512, 13, 13)
[TRT] /home/erisuser/p4sw/sw/gpgpu/MachineLearning/DIT/release/5.1/parsers/onnxOpenSource/builtin_op_importers.cpp:771: Convolution input dimensions: (512, 13, 13)
[TRT] /home/erisuser/p4sw/sw/gpgpu/MachineLearning/DIT/release/5.1/parsers/onnxOpenSource/builtin_op_importers.cpp:835: Using kernel: (3, 3), strides: (1, 1), padding: (0, 0), dilations: (1, 1), numOutputs: 1024
[TRT] /home/erisuser/p4sw/sw/gpgpu/MachineLearning/DIT/release/5.1/parsers/onnxOpenSource/builtin_op_importers.cpp:836: Convolution output dimensions: (1024, 13, 13)
[TRT] convolution2d_7_output:Conv → (1024, 13, 13)
[TRT] batchnormalization_7_output:BatchNormalization → (1024, 13, 13)
[TRT] leakyrelu_7_output:LeakyRelu → (1024, 13, 13)
[TRT] /home/erisuser/p4sw/sw/gpgpu/MachineLearning/DIT/release/5.1/parsers/onnxOpenSource/builtin_op_importers.cpp:771: Convolution input dimensions: (1024, 13, 13)
[TRT] /home/erisuser/p4sw/sw/gpgpu/MachineLearning/DIT/release/5.1/parsers/onnxOpenSource/builtin_op_importers.cpp:835: Using kernel: (3, 3), strides: (1, 1), padding: (0, 0), dilations: (1, 1), numOutputs: 1024
[TRT] /home/erisuser/p4sw/sw/gpgpu/MachineLearning/DIT/release/5.1/parsers/onnxOpenSource/builtin_op_importers.cpp:836: Convolution output dimensions: (1024, 13, 13)
[TRT] convolution2d_8_output:Conv → (1024, 13, 13)
[TRT] batchnormalization_8_output:BatchNormalization → (1024, 13, 13)
[TRT] leakyrelu_8_output:LeakyRelu → (1024, 13, 13)
[TRT] /home/erisuser/p4sw/sw/gpgpu/MachineLearning/DIT/release/5.1/parsers/onnxOpenSource/builtin_op_importers.cpp:771: Convolution input dimensions: (1024, 13, 13)
[TRT] /home/erisuser/p4sw/sw/gpgpu/MachineLearning/DIT/release/5.1/parsers/onnxOpenSource/builtin_op_importers.cpp:835: Using kernel: (1, 1), strides: (1, 1), padding: (0, 0), dilations: (1, 1), numOutputs: 125
[TRT] /home/erisuser/p4sw/sw/gpgpu/MachineLearning/DIT/release/5.1/parsers/onnxOpenSource/builtin_op_importers.cpp:836: Convolution output dimensions: (125, 13, 13)
[TRT] grid:Conv → (125, 13, 13)
[TRT] retrieved Input tensor “image”: 3x416x416
[TRT] device GPU, configuring CUDA engine
[TRT] device GPU, building FP16: ON
[TRT] device GPU, building INT8: OFF
[TRT] device GPU, building CUDA engine (this may take a few minutes the first time a network is loaded)
[TRT] device GPU, completed building CUDA engine
[TRT] network profiling complete, writing engine cache to /home//jetson-inference/data/networks/test/Model.onnx.1.1.GPU.FP16.engine
[TRT] device GPU, completed writing engine cache to /home//jetson-inference/data/networks/test/Model.onnx.1.1.GPU.FP16.engine
[TRT] device GPU, /home/***/jetson-inference/data/networks/test/Model.onnx loaded
[TRT] device GPU, CUDA engine context initialized with 2 bindings
[TRT] binding – index 0
– name ‘image’
– type FP32
– in/out INPUT
– # dims 3
– dim #0 3 (CHANNEL)
– dim #1 416 (SPATIAL)
– dim #2 416 (SPATIAL)
[TRT] binding – index 1
– name ‘grid’
– type FP32
– in/out OUTPUT
– # dims 3
– dim #0 125 (CHANNEL)
– dim #1 13 (SPATIAL)
– dim #2 13 (SPATIAL)
[TRT] binding to input 0 data binding index: -1
[TRT] binding to input 0 data dims (b=1 c=0 h=0 w=0) size=0
[TRT] failed to alloc CUDA mapped memory for tensor input, 0 bytes
detectNet – failed to initialize.
jetson.inference – detectNet failed to load built-in network ‘ssd-mobilenet-v2’
PyTensorNet_Dealloc()
Traceback (most recent call last):
File “detectnet-console.py”, line 51, in
net = jetson.inference.detectNet(opt.network, sys.argv, opt.threshold)
Exception: jetson.inference – detectNet failed to load network
jetson.utils – freeing CUDA mapped memory
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++

II flashed the Jetson TX2 with latest sdkmanager, i.e.: “sdkmanager_0.9.14-4964_amd64.deb”.
Jetson: Jetson TX2, Pascal GPU with 256 CUDA-cores; 64-bit NVIDIA Denver and ARM Cortex-A57 CPUs; 8 GB LPDDR4 Memory; 32GB eMMC 5.1 Flash Storage; Graphics: NVIDIA Tegra X2 (nvgpu)/integrated,

OS: Ubuntu 18.04 LTS, 64-bit

TensorRT version: 5.1.6.1-1+cuda10.0

Python version: Python 3.6.8

Please help me to figure out this issue.
Thank you!

NVES_R · October 30, 2019, 4:07pm

Moved to Jetson TX2 forum. Someone here should be able to help you.

Thanks,
NVIDIA Enterprise Support

dusty_nv · October 30, 2019, 4:09pm

Hi tairen, it appears from this part of the log, that the correct name of the input/output layers should be --input_blob=image --output_bbox=grid

[TRT] binding -- index 0
-- name 'image'
-- type FP32
-- in/out INPUT
-- # dims 3
-- dim #0 3 (CHANNEL)
-- dim #1 416 (SPATIAL)
-- dim #2 416 (SPATIAL)
[TRT] binding -- index 1
-- name 'grid'
-- type FP32
-- in/out OUTPUT
-- # dims 3
-- dim #0 125 (CHANNEL)
-- dim #1 13 (SPATIAL)
-- dim #2 13 (SPATIAL)

That may allow you to load the network at least. However, I have not implemented the pre/post-processing for this YOLO ONNX model. That code would go here in the detectNet.cpp source code:

ONNX pre-processing: https://github.com/dusty-nv/jetson-inference/blob/87b5a8814a60e860709b35f3f774907c249db081/c/detectNet.cpp#L643
ONNX post-processing: https://github.com/dusty-nv/jetson-inference/blob/87b5a8814a60e860709b35f3f774907c249db081/c/detectNet.cpp#L735

The ONNX pre/post-processing code that is there now was just a test I was doing with some simpler regression model from PyTorch, as I could not get full detection model working through ONNX with PyTorch yet. However I am interested to test this YOLO ONNX model you pointed to since TensorRT is able to parse it. It would probably be a couple weeks before I’m able to work on it though, sorry. In the meantime you are welcome to modify the code.

Topic		Replies	Views
Errors in converting yolov3 to TensorRT Jetson TX2	9	1224	October 18, 2021
Running a pytorch network converted to ONNX with TensorRT on the TX2 Jetson TX2	24	8987	October 18, 2021
TensorRT backend for ONNX on jetson nano Jetson Nano tensorrt	31	10936	October 15, 2021
Erorr with onnx to trt Jetson Xavier NX tensorrt	8	1274	March 30, 2022
TensorRT YOLO inference error Jetson TX1	21	12488	October 18, 2021
Convert onnx to tensorrt error on Jetson Xavier. Jetson AGX Xavier	6	1647	October 18, 2021
Using ONNX Runtime with TensorRT on Jetson Devices Jetson AGX Xavier tensorrt	5	1143	October 18, 2021
Inference of Yolov3.onnx model TensorRT	0	1215	January 8, 2020
DeepStream DeepStream SDK	9	596	October 12, 2021
Tensorrt from onnx ERR: atleast 4 dimensions are required for input TensorRT	11	4178	February 17, 2020

Failed to alloc CUDA mapped memory for tensor input when load tiny_yolov2.onnx

Related topics