Please provide the following information when requesting support.
• Hardware: training - T4 (g4dn at AWS), inference - Jetson Xavier NX
• Network Type: Yolo_v4_tiny (LPDNet)
• TLT Version toolkit_version: 6.0.0 published_date: 07/11/2025, also used v.4.0.1 for training as proposed in this post ( Key used to load the model is incorrect - #2 by Morganh )
• Training spec file attached
yolo_v4_tiny_train_kitti.txt (2.0 KB)
I retrained LPDNet model on my dataset using tao toolkit and got .etlt model.
Then I converted the .etlt model to the .engine using tao-converter on my Jetson Xavier NX with this command:
./tao-converter yolov4_cspdarknet_tiny_epoch_070.etlt -k nvidia_tlt -d 3,480,640 -p Input,1x3x480x640,8x3x480x640,16x3x480x640, -c cal.bin -b 1 -m 16 -t int8 -o 'output_bbox/BiasAdd','output_cov/Sigmoid' -e yolov4_tiny_lpdnet_elevated_b16_int8.engine
However, when I run an inference using this retrained model, I’ve got an error because of wrong input binding size:
TRT] CUDA engine context initialized on device GPU:
[TRT] -- layers 96
[TRT] -- maxBatchSize 1
[TRT] -- deviceMemory 100868608
[TRT] -- bindings 5
[TRT] binding 0
-- index 0
-- name 'Input'
-- type FP32
-- in/out INPUT
-- # dims 4
-- dim #0 -1
-- dim #1 3
-- dim #2 480
-- dim #3 640
[TRT] binding 1
-- index 1
-- name 'BatchedNMS'
-- type INT32
-- in/out OUTPUT
-- # dims 2
-- dim #0 -1
-- dim #1 1
[TRT] binding 2
-- index 2
-- name 'BatchedNMS_1'
-- type FP32
-- in/out OUTPUT
-- # dims 3
-- dim #0 -1
-- dim #1 200
-- dim #2 4
[TRT] binding 3
-- index 3
-- name 'BatchedNMS_2'
-- type FP32
-- in/out OUTPUT
-- # dims 2
-- dim #0 -1
-- dim #1 200
[TRT] binding 4
-- index 4
-- name 'BatchedNMS_3'
-- type FP32
-- in/out OUTPUT
-- # dims 2
-- dim #0 -1
-- dim #1 200
[TRT]
[TRT] binding to input 0 Input binding index: 0
[TRT] binding to input 0 Input dims (b=1 c=4294967295 h=3 w=480) size=18446744073705865216
[cuda] cudaMalloc((void**)&inputCUDA, inputSize)
[cuda] out of memory (error 2) (hex 0x02)
[cuda] /home/artem/Projects/jetson-inference/c/tensorNet.cpp:1583
[TRT] failed to alloc CUDA device memory for tensor input, 18446744073705865216 bytes
[TRT] device GPU, failed to create resources for CUDA engine
[TRT] failed to load yolov4-tiny_elevated/yolov4_tiny_lpdnet_elevated_b16_int8.engine
[TRT] detectNet -- failed to initialize.
Below is the TRT output of the original LPDNet model:
TRT] CUDA engine context initialized on device GPU:
[TRT] -- layers 4
[TRT] -- maxBatchSize 16
[TRT] -- deviceMemory 40550400
[TRT] -- bindings 3
[TRT] binding 0
-- index 0
-- name 'input_1'
-- type FP32
-- in/out INPUT
-- # dims 3
-- dim #0 3
-- dim #1 480
-- dim #2 640
[TRT] binding 1
-- index 1
-- name 'output_bbox/BiasAdd'
-- type FP32
-- in/out OUTPUT
-- # dims 3
-- dim #0 4
-- dim #1 30
-- dim #2 40
[TRT] binding 2
-- index 2
-- name 'output_cov/Sigmoid'
-- type FP32
-- in/out OUTPUT
-- # dims 3
-- dim #0 1
-- dim #1 30
-- dim #2 40
[TRT]
[TRT] binding to input 0 input_1 binding index: 0
[TRT] binding to input 0 input_1 dims (b=16 c=3 h=480 w=640) size=58982400
[TRT] binding to output 0 output_cov/Sigmoid binding index: 2
[TRT] binding to output 0 output_cov/Sigmoid dims (b=16 c=1 h=30 w=40) size=76800
[TRT] binding to output 1 output_bbox/BiasAdd binding index: 1
[TRT] binding to output 1 output_bbox/BiasAdd dims (b=16 c=4 h=30 w=40) size=307200
How do I make the bindings correct in the retrained model?