Tlt-convert for custom trained YoloV4 model failed on Jetson Nano 4G

I have trained a YoloV4 model then try to convert to the engine file on my Jetson Nano. The full command is

tlt-converter -k tlt-encode \
              -d 3,608,608 \
              -o BatchedNMS \
              -e trt1.engine \
              -m 2 \
              -t fp16 \
              -i nchw \
              -p Input,1x3x608x608,1x3x608x608,2x3x608x608 \
              -w 1610612736 \
              yolov4_cspdarknet19_epoch_055.etlt

The error message is like,

[INFO] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[ERROR] ../builder/cudnnBuilderUtils.cpp (414) - Cuda Error in findFastestTactic: 98 (invalid device function)
[WARNING] GPU memory allocation error during getBestTactic: BatchedNMS_N
[ERROR] ../builder/cudnnBuilderUtils.cpp (414) - Cuda Error in findFastestTactic: 98 (invalid device function)
[WARNING] GPU memory allocation error during getBestTactic: BatchedNMS_N
[ERROR] Try increasing the workspace size with IBuilderConfig::setMaxWorkspaceSize() if using IBuilder::buildEngineWithConfig, or IBuilder::setMaxWorkspaceSize() if using IBuilder::buildCudaEngine.
[ERROR] ../builder/tacticOptimizer.cpp (1715) - TRTInternal Error in computeCosts: 0 (Could not find any implementation for node BatchedNMS_N.)
[ERROR] ../builder/tacticOptimizer.cpp (1715) - TRTInternal Error in computeCosts: 0 (Could not find any implementation for node BatchedNMS_N.)
[ERROR] Unable to create engine

Seems to be out out memory.
-m maximum TensorRT engine batch size (default 16). If meet with out-of-memory issue, please decrease the batch size accordingly.
Please try to decrease -m to 1 and retry.

Unfortunately, it still fails so that I suspect there may be other issues and how can I track it down?

 tlt-converter -k tlt-encode \
>               -d 3,608,608 \
>               -o BatchedNMS \
>               -e trt1.engine \
>               -m 1 \
>               -t fp16 \
>               -i nchw \
>               -p Input,1x3x608x608,1x3x608x608,1x3x608x608 \
>               -w 1000000000 \
>               yolov4_cspdarknet19_epoch_055.etlt

Please remove -p Input,1x3x608x608,1x3x608x608,1x3x608x608 .
Refer to the yolo_v4 jupyter notebook.

When I removed the line of -p, the error message is as follows,

[INFO] Detected input dimensions from the model: (-1, 3, 608, 608)
[ERROR] Model has dynamic shape but no optimization profile specified.

For TLT 3.0-py3 version, “-p” is needed for yolo_v4. See YOLOv4 — Transfer Learning Toolkit 3.0 documentation

To narrow down, can you run tlt-converter successfully in the machine where you run the training?

Yes I did successfully run the tlt-converter from Jupyter notebook from a server with -p added. However, the created engine file won’t be working with the the jetson, that is why I try to do tlt-converter from the jetson nano but the w/o success so far.

I pasted the error message from running my app

NvDsInferContextImpl::generateBackendContext() <nvdsinfer_context_impl.cpp:1798> [UID = 1]: deserialize backend context from engine from file :/opt/nvidia/deepstream/deepstream-5.1/samples/models/tlt_pretrained_models/firenet/trt.engine failed, try rebuild
0:00:07.345111513 27444     0x39be5670 INFO                 nvinfer gstnvinfer.cpp:619:gst_nvinfer_logger:<primary-inference> NvDsInferContext[UID 1]: Info from NvDsInferContextImpl::buildModel() <nvdsinfer_context_impl.cpp:1716> [UID = 1]: Trying to create engine from model files
ERROR: failed to build network since there is no model file matched.
ERROR: failed to build network.

So, it is a still an OOM issue. Can you try to check if below works?

  • restart Nano
  • increase “-w”

I used -w 2130000000, the mem usage shows below

BTW, I have boot it the jetson into text mode, the initial mem usage is only 0.4GB

I am trying to run it will let you know the result in a few minutes.

Seems we made a little progress …

[ERROR] /home/jenkins/workspace/TensorRT/helpers/rel-7.1/L1_Nightly_Internal/build/source/rtSafe/resources.h (460) - Cuda Error in loadKernel: 702 (the launch timed out and was terminated)
[ERROR] ../rtSafe/safeRuntime.cpp (32) - Cuda Error in free: 702 (the launch timed out and was terminated)
terminate called after throwing an instance of 'nvinfer1::CudaError'
  what():  std::exception

There is one experiment here.I suggest you trying to train a yolo_v4 model with smaller input_size. For example, 128x128.
You can just train for 1 epoch. Then export the tlt model into etlt model. Next, copy the etlt model into the Nano and run tlt-converter again.

Thanks for your advice. I will give it a try!

Doesn’t seem to work :(

Is there a way to run tlt-converter in the server for the jetson nano? Jetson Nano is too limited in resources.

If you run inference in Nano, it is suggested to generate trt engine in Nano to avoid TRT mismatching error.
Can you try more experiments for yolo_v4?
Please download the models , see deepstream_tlt_apps/download_models.sh at master · NVIDIA-AI-IOT/deepstream_tlt_apps · GitHub

# For Faster-RCNN / YoloV3 / YoloV4 /SSD / DSSD / RetinaNet/ UNET/:
# wget https://nvidia.box.com/shared/static/i1cer4s3ox4v8svbfkuj5js8yqm3yazo.zip -O models.zip

Try to run tlt-converter against these models which are trained from Nvidia.
Their key is nvidia_tlt. Input size is 960x544

Ok let me play with them

I have downloaded the models then I tried tlt-converter for yolov4_resnet18.etlt. The command I used is

tlt-converter -k tlt-encode  \
                    -d 3,384,1248 \
                    -o BatchedNMS \
                    -e trt.fp16.engine \
                    -t fp16 \
                    -i nchw \
                    -m 8 \
                    yolov4_resnet18.etlt

The error message is as follows,

[libprotobuf ERROR google/protobuf/text_format.cc:298] Error parsing text-format onnx2trt_onnx.ModelProto: 1:1: Invalid control characters encountered in text.
[libprotobuf ERROR google/protobuf/text_format.cc:298] Error parsing text-format onnx2trt_onnx.ModelProto: 1:3: Interpreting non ascii codepoint 200.
[libprotobuf ERROR google/protobuf/text_format.cc:298] Error parsing text-format onnx2trt_onnx.ModelProto: 1:3: Message type "onnx2trt_onnx.ModelProto" has no field named "u".
Failed to parse ONNX model from file/tmp/fileQlEezP
[INFO] Model has no dynamic shape.
[ERROR] Network must have at least one output
[ERROR] Network validation failed.
[ERROR] Unable to create engine
Segmentation fault (core dumped)

Please set to -d 3,544,960

Nope, the result is the same

Can you add “-p” option?