Tlt-convert for custom trained YoloV4 model failed on Jetson Nano 4G

In one Nano board, I can generate trt engine successfully.
Where did you download tlt-converter?

$ ./tlt-converter -k nvidia_tlt -d 3,544,960 -e trt.fp16.engine -t fp16 -p Input,1x3x544x960,8x3x544x960,16x3x544x960 yolov4_resnet18.etlt
[WARNING] onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[INFO] ModelImporter.cpp:135: No importer registered for op: BatchedNMSDynamic_TRT. Attempting to import as plugin.
[INFO] builtin_op_importers.cpp:3659: Searching for plugin: BatchedNMSDynamic_TRT, plugin_version: 1, plugin_namespace:
[INFO] builtin_op_importers.cpp:3676: Successfully created plugin: BatchedNMSDynamic_TRT
[INFO] Detected input dimensions from the model: (-1, 3, 544, 960)
[INFO] Model has dynamic shape. Setting up optimization profiles.
[INFO] Using optimization profile min shape: (1, 3, 544, 960) for input: Input
[INFO] Using optimization profile opt shape: (8, 3, 544, 960) for input: Input
[INFO] Using optimization profile max shape: (16, 3, 544, 960) for input: Input
[INFO] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.

[INFO] Detected 1 inputs and 4 output network tensors.

https://docs.nvidia.com/tlt/tlt-user-guide/text/tensorrt.html#tlt-converter-matrix

I looked up from this link, I will double check if this is correct version!

Seems I used the wrong key - it should be nvidia_tlt, I used tlt-encode!!! Silly mistakes

I run your command from you Nano, seems the memory issue appears again…

[ERROR] ../builder/cudnnBuilderUtils.cpp (414) - Cuda Error in findFastestTactic: 98 (invalid device function)
[WARNING] GPU memory allocation error during getBestTactic: BatchedNMS_N
[ERROR] ../builder/cudnnBuilderUtils.cpp (414) - Cuda Error in findFastestTactic: 98 (invalid device function)
[WARNING] GPU memory allocation error during getBestTactic: BatchedNMS_N
[ERROR] Try increasing the workspace size with IBuilderConfig::setMaxWorkspaceSize() if using IBuilder::buildEngineWithConfig, or IBuilder::setMaxWorkspaceSize() if using IBuilder::buildCudaEngine.
[ERROR] ../builder/tacticOptimizer.cpp (1715) - TRTInternal Error in computeCosts: 0 (Could not find any implementation for node BatchedNMS_N.)
[ERROR] ../builder/tacticOptimizer.cpp (1715) - TRTInternal Error in computeCosts: 0 (Could not find any implementation for node BatchedNMS_N.)
[ERROR] Unable to create engine
Segmentation fault (core dumped)

BTW, did you activate the jetson_clocks?

Yes, I run it.
$ sudo nvpmodel -m 0
$ jetson_clocks

Which Jetpack version did you install? What is the output of “$ dpkg -l |grep cuda” ?

I have activate jetson_clocks and turn the full power on.
dpkg -l |grep cuda output

kai@kai-jetson:~/workspace/deepstream_tlt_apps/models/yolov4$ dpkg -l |grep cuda
ii  cuda-command-line-tools-10-2               10.2.89-1                                        arm64        CUDA command-line tools
ii  cuda-compiler-10-2                         10.2.89-1                                        arm64        CUDA compiler
ii  cuda-cudart-10-2                           10.2.89-1                                        arm64        CUDA Runtime native Libraries
ii  cuda-cudart-dev-10-2                       10.2.89-1                                        arm64        CUDA Runtime native dev links, headers
ii  cuda-cufft-10-2                            10.2.89-1                                        arm64        CUFFT native runtime libraries
ii  cuda-cufft-dev-10-2                        10.2.89-1                                        arm64        CUFFT native dev links, headers
ii  cuda-cuobjdump-10-2                        10.2.89-1                                        arm64        CUDA cuobjdump
ii  cuda-cupti-10-2                            10.2.89-1                                        arm64        CUDA profiling tools runtime libs.
ii  cuda-cupti-dev-10-2                        10.2.89-1                                        arm64        CUDA profiling tools interface.
ii  cuda-curand-10-2                           10.2.89-1                                        arm64        CURAND native runtime libraries
ii  cuda-curand-dev-10-2                       10.2.89-1                                        arm64        CURAND native dev links, headers
ii  cuda-cusolver-10-2                         10.2.89-1                                        arm64        CUDA solver native runtime libraries
ii  cuda-cusolver-dev-10-2                     10.2.89-1                                        arm64        CUDA solver native dev links, headers
ii  cuda-cusparse-10-2                         10.2.89-1                                        arm64        CUSPARSE native runtime libraries
ii  cuda-cusparse-dev-10-2                     10.2.89-1                                        arm64        CUSPARSE native dev links, headers
ii  cuda-documentation-10-2                    10.2.89-1                                        arm64        CUDA documentation
ii  cuda-driver-dev-10-2                       10.2.89-1                                        arm64        CUDA Driver native dev stub library
ii  cuda-gdb-10-2                              10.2.89-1                                        arm64        CUDA-GDB
ii  cuda-libraries-10-2                        10.2.89-1                                        arm64        CUDA Libraries 10.2 meta-package
ii  cuda-libraries-dev-10-2                    10.2.89-1                                        arm64        CUDA Libraries 10.2 development meta-package
ii  cuda-license-10-2                          10.2.89-1                                        arm64        CUDA licenses
ii  cuda-memcheck-10-2                         10.2.89-1                                        arm64        CUDA-MEMCHECK
ii  cuda-misc-headers-10-2                     10.2.89-1                                        arm64        CUDA miscellaneous headers
ii  cuda-npp-10-2                              10.2.89-1                                        arm64        NPP native runtime libraries
ii  cuda-npp-dev-10-2                          10.2.89-1                                        arm64        NPP native dev links, headers
ii  cuda-nvcc-10-2                             10.2.89-1                                        arm64        CUDA nvcc
ii  cuda-nvdisasm-10-2                         10.2.89-1                                        arm64        CUDA disassembler
ii  cuda-nvgraph-10-2                          10.2.89-1                                        arm64        NVGRAPH native runtime libraries
ii  cuda-nvgraph-dev-10-2                      10.2.89-1                                        arm64        NVGRAPH native dev links, headers
ii  cuda-nvml-dev-10-2                         10.2.89-1                                        arm64        NVML native dev links, headers
ii  cuda-nvprof-10-2                           10.2.89-1                                        arm64        CUDA Profiler tools
ii  cuda-nvprune-10-2                          10.2.89-1                                        arm64        CUDA nvprune
ii  cuda-nvrtc-10-2                            10.2.89-1                                        arm64        NVRTC native runtime libraries
ii  cuda-nvrtc-dev-10-2                        10.2.89-1                                        arm64        NVRTC native dev links, headers
ii  cuda-nvtx-10-2                             10.2.89-1                                        arm64        NVIDIA Tools Extension
ii  cuda-repo-l4t-10-2-local-10.2.89           1.0-1                                            arm64        cuda repository configuration files
ii  cuda-samples-10-2                          10.2.89-1                                        arm64        CUDA example applications
ii  cuda-toolkit-10-2                          10.2.89-1                                        arm64        CUDA Toolkit 10.2 meta-package
ii  cuda-tools-10-2                            10.2.89-1                                        arm64        CUDA Tools meta-package
ii  graphsurgeon-tf                            7.1.3-1+cuda10.2                                 arm64        GraphSurgeon for TensorRT package
ii  libcudnn8                                  8.0.0.180-1+cuda10.2                             arm64        cuDNN runtime libraries
ii  libcudnn8-dev                              8.0.0.180-1+cuda10.2                             arm64        cuDNN development libraries and headers
ii  libcudnn8-doc                              8.0.0.180-1+cuda10.2                             arm64        cuDNN documents and samples
ii  libnvinfer-bin                             7.1.3-1+cuda10.2                                 arm64        TensorRT binaries
ii  libnvinfer-dev                             7.1.3-1+cuda10.2                                 arm64        TensorRT development libraries and headers
ii  libnvinfer-doc                             7.1.3-1+cuda10.2                                 all          TensorRT documentation
ii  libnvinfer-plugin-dev                      7.1.3-1+cuda10.2                                 arm64        TensorRT plugin libraries
ii  libnvinfer-plugin7                         7.1.3-1+cuda10.2                                 arm64        TensorRT plugin libraries
ii  libnvinfer-samples                         7.1.3-1+cuda10.2                                 all          TensorRT samples
ii  libnvinfer7                                7.1.3-1+cuda10.2                                 arm64        TensorRT runtime libraries
ii  libnvonnxparsers-dev                       7.1.3-1+cuda10.2                                 arm64        TensorRT ONNX libraries
ii  libnvonnxparsers7                          7.1.3-1+cuda10.2                                 arm64        TensorRT ONNX libraries
ii  libnvparsers-dev                           7.1.3-1+cuda10.2                                 arm64        TensorRT parsers libraries
ii  libnvparsers7                              7.1.3-1+cuda10.2                                 arm64        TensorRT parsers libraries
ii  nvidia-container-csv-cuda                  10.2.89-1                                        arm64        Jetpack CUDA CSV file
ii  nvidia-container-csv-cudnn                 8.0.0.180-1+cuda10.2                             arm64        Jetpack CUDNN CSV file
ii  nvidia-container-csv-tensorrt              7.1.3.0-1+cuda10.2                               arm64        Jetpack TensorRT CSV file
ii  nvidia-l4t-cuda                            32.5.1-20210219084526                            arm64        NVIDIA CUDA Package
ii  python-libnvinfer                          7.1.3-1+cuda10.2                                 arm64        Python bindings for TensorRT
ii  python-libnvinfer-dev                      7.1.3-1+cuda10.2                                 arm64        Python development package for TensorRT
ii  python3-libnvinfer                         7.1.3-1+cuda10.2                                 arm64        Python 3 bindings for TensorRT
ii  python3-libnvinfer-dev                     7.1.3-1+cuda10.2                                 arm64        Python 3 development package for TensorRT
ii  tensorrt                                   7.1.3.0-1+cuda10.2                               arm64        Meta package of TensorRT
ii  uff-converter-tf                           7.1.3-1+cuda10.2                                 arm64        UFF converter for TensorRT package

I ran with the same cuda/cudnn/trt version as you.
Can you generate trt engine again and check system status at the same time?
$ sudo tegrastats

I ran with the same cuda/cudnn/trt version as you.
Can you generate trt engine again check system status at the same time?
$ sudo tegrastats

When the problem happens, the peak of the ram is 3289

RAM 3289/3963MB (lfb 82x4MB) SWAP 599/10173MB (cached 20MB) IRAM 0/252kB(lfb 252kB) CPU [5%@1479,2%@1479,0%@1479,4%@1479] EMC_FREQ 64%@1600 GR3D_FREQ 99%@921 VIC_FREQ 0%@192 APE 25 PLL@42C CPU@45C PMIC@100C GPU@43.5C AO@51C thermal@44C POM_5V_IN 6451/4307 POM_5V_GPU 3225/1299 POM_5V_CPU 489/910
RAM 1607/3963MB (lfb 165x4MB) SWAP 598/10173MB (cached 20MB) IRAM 0/252kB(lfb 252kB) CPU [10%@1479,3%@1479,1%@1479,39%@1479] EMC_FREQ 57%@1600 GR3D_FREQ 0%@921 VIC_FREQ 0%@192 APE 25 PLL@42C CPU@45C PMIC@100C GPU@42.5C AO@51C thermal@43.75C POM_5V_IN 3102/4299 POM_5V_GPU 167/1292 POM_5V_CPU 962/910
RAM 758/3963MB (lfb 192x4MB) SWAP 107/10173MB (cached 10MB) IRAM 0/252kB(lfb 252kB) CPU [33%@1479,0%@1479,1

Can you try to generate yolo_v3 model as well? I can run it successfully in my Nano.

$ ./tlt-converter -k nvidia_tlt -d 3,544,960 -e trt.fp16.engine -t fp16 -p Input,1x3x544x960,1x3x544x960,2x3x544x960 yolov3_resnet18.etlt
[WARNING] onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[INFO] ModelImporter.cpp:135: No importer registered for op: BatchedNMSDynamic_TRT. Attempting to import as plugin.
[INFO] builtin_op_importers.cpp:3659: Searching for plugin: BatchedNMSDynamic_TRT, plugin_version: 1, plugin_namespace:
[INFO] builtin_op_importers.cpp:3676: Successfully created plugin: BatchedNMSDynamic_TRT
[INFO] Detected input dimensions from the model: (-1, 3, 544, 960)
[INFO] Model has dynamic shape. Setting up optimization profiles.
[INFO] Using optimization profile min shape: (1, 3, 544, 960) for input: Input
[INFO] Using optimization profile opt shape: (1, 3, 544, 960) for input: Input
[INFO] Using optimization profile max shape: (2, 3, 544, 960) for input: Input
[INFO] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[INFO] Detected 1 inputs and 4 output network tensors.
$ ls trt.fp16.engine
trt.fp16.engine

Please share your full log.

kai@kai-jetson:~/workspace/deepstream_tlt_apps/models/yolov3$ ./tlt-converter -k nvidia_tlt -d 3,544,960 -e trt.fp16.engine -t fp16 -p Input,1x3x544x960,1x3x544x960,2x3x544x960 yolov3_resnet18.etlt

[WARNING] onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[INFO] ModelImporter.cpp:135: No importer registered for op: BatchedNMSDynamic_TRT. Attempting to import as plugin.
[INFO] builtin_op_importers.cpp:3659: Searching for plugin: BatchedNMSDynamic_TRT, plugin_version: 1, plugin_namespace:
[INFO] builtin_op_importers.cpp:3676: Successfully created plugin: BatchedNMSDynamic_TRT
[INFO] Detected input dimensions from the model: (-1, 3, 544, 960)
[INFO] Model has dynamic shape. Setting up optimization profiles.
[INFO] Using optimization profile min shape: (1, 3, 544, 960) for input: Input
[INFO] Using optimization profile opt shape: (1, 3, 544, 960) for input: Input
[INFO] Using optimization profile max shape: (2, 3, 544, 960) for input: Input
[INFO] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[ERROR] ../builder/cudnnBuilderUtils.cpp (414) - Cuda Error in findFastestTactic: 98 (invalid device function)
[WARNING] GPU memory allocation error during getBestTactic: BatchedNMS_N
[ERROR] ../builder/cudnnBuilderUtils.cpp (414) - Cuda Error in findFastestTactic: 98 (invalid device function)
[WARNING] GPU memory allocation error during getBestTactic: BatchedNMS_N
[ERROR] Try increasing the workspace size with IBuilderConfig::setMaxWorkspaceSize() if using IBuilder::buildEngineWithConfig, or IBuilder::setMaxWorkspaceSize() if using IBuilder::buildCudaEngine.
[ERROR] ../builder/tacticOptimizer.cpp (1715) - TRTInternal Error in computeCosts: 0 (Could not find any implementation for node BatchedNMS_N.)
[ERROR] ../builder/tacticOptimizer.cpp (1715) - TRTInternal Error in computeCosts: 0 (Could not find any implementation for node BatchedNMS_N.)
[ERROR] Unable to create engine
Segmentation fault (core dumped)

If possible, you can try to re-flash with Jetpack 4.4 or 4.5. And run above to check if it still happens.

Ok, I will try on another board when I got time

I tried to rebuild with -t f32 it succeeded! Indeeded this is good progress but I don’t know why. Do you have any thoughts on this?

Which model did you build successfully in fp32 mode? Is it in deepstream_tlt_apps/download_models.sh at master · NVIDIA-AI-IOT/deepstream_tlt_apps · GitHub or your own model or both?

Both are successfully built.

Bad news again, when I used the engine file to do the infer on the jetson nano, the model is load successfully,

0:00:09.885840553  8742     0x2d171c70 INFO                 nvinfer gstnvinfer.cpp:619:gst_nvinfer_logger:<primary-inference> NvDsInferContext[UID 1]: Info from NvDsInferContextImpl::deserializeEngineAndBackend() <nvdsinfer_context_impl.cpp:1702> [UID = 1]: deserialized trt engine from :/opt/nvidia/deepstream/deepstream-5.1/samples/models/tlt_pretrained_models/firenet/trt1.engine
INFO: [FullDims Engine Info]: layers num: 5
0   INPUT  kFLOAT Input           3x608x608       min: 1x3x608x608     opt: 8x3x608x608     Max: 16x3x608x608    
1   OUTPUT kINT32 BatchedNMS      1               min: 0               opt: 0               Max: 0               
2   OUTPUT kFLOAT BatchedNMS_1    200x4           min: 0               opt: 0               Max: 0               
3   OUTPUT kFLOAT BatchedNMS_2    200             min: 0               opt: 0               Max: 0               
4   OUTPUT kFLOAT BatchedNMS_3    200             min: 0               opt: 0               Max: 0               

0:00:09.886070038  8742     0x2d171c70 INFO                 nvinfer gstnvinfer.cpp:619:gst_nvinfer_logger:<primary-inference> NvDsInferContext[UID 1]: Info from NvDsInferContextImpl::generateBackendContext() <nvdsinfer_context_impl.cpp:1806> [UID = 1]: Use deserialized engine model: /opt/nvidia/deepstream/deepstream-5.1/samples/models/tlt_pretrained_models/firenet/trt1.engine
0:00:10.090188678  8742     0x2d171c70 INFO                 nvinfer gstnvinfer_impl.cpp:313:notifyLoadModelStatus:<primary-inference> [UID 1]: Load new model:/home/kai/workspace/firenet/ds-fire-perception/../specs/pgie_yolov4.txt sucessfully

However, the infer failed with error message

ERROR: [TRT]: Assertion failed: status == STATUS_SUCCESS
/home/kai/workspace/TensorRT/plugin/batchedNMSPlugin/batchedNMSPlugin.cpp:246
Aborting...

The config file used in my code is

[property]
gpu-id=0
net-scale-factor=1.0
offsets=103.939;116.779;123.68
model-color-format=1
labelfile-path=labels.txt
model-engine-file=../models/tlt_pretrained_models/firenet/trt1.engine
tlt-model-key=tlt-encode
infer-dims=3;608;608
maintain-aspect-ratio=1
uff-input-order=0
uff-input-blob-name=Input
batch-size=1
## 0=FP32, 1=INT8, 2=FP16 mode
network-mode=0
num-detected-classes=1
interval=0
gie-unique-id=1
is-classifier=0
network-type=1
#no cluster
cluster-mode=3
output-blob-names=BatchedNMS
parse-bbox-func-name=NvDsInferParseCustomBatchedNMSTLT
custom-lib-path=../models/lib/libnvds_infercustomparser_tlt.so

[class-attrs-all]
pre-cluster-threshold=0.3
roi-top-offset=0
roi-bottom-offset=0
detected-min-w=0
detected-min-h=0
detected-max-w=0
detected-max-h=0

Did you build TRT OSS plugin build mentioned in the tlt user guide?
Reference topic: Convert tensorrt engine from version 7 to 8 - #67 by Morganh

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.