Convert TAO Yolov4 model to DLA engine fails

Please provide the following information when requesting support.

• Xavier NX
• YoloV4 Darknet19
• TAO 3.0
• Training spec file(If have, please share here)

I have trained a yoloV4, arch= darknet19 model using TAO (https://docs.nvidia.com/tao/tao-toolkit/text/object_detection/yolo_v4.html#exporting-the-model).

I can use the “tao-converter-jp46-trt8.0.1.6” to generate an engine that runs in deepstream-app 6.0. When I try and build an engine with the dla -u flag set to 0 or 1. The engine build fails.

Error

“Module_id 33 Severity 2 : NVMEDIA_DLA 2493
Module_id 33 Severity 2 : Runtime: loadBare failed. error: 0x000004
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1544, GPU 7169 (MiB)
[ERROR] 1: [nvdlaUtils.cpp::deserialize::164] Error Code 1: DLA (NvMediaDlaLoadLoadable : load loadable failed.)
[ERROR] Unable to create engine
Segmentation fault (core dumped)”

I have tried the suggestion in this link “Failed to create DLA engine from .etlt model - #11 by Morganh” And I can build the peoplesegnet_resnet50. engine successfully.

Why is my engine failing?
./tao-converter -o BatchedNMS -d 3,704,1280 -p Input,1x3x704x1280,1x3x704x1280,1x3x704x1280 -u 0 -t fp16 -w 6000000000 -k nnnn12345 yolov4_darknet19.etlt

I’m using jetpack 4.6 L4T 32.6.1 on Xavier NX

To narrow down, can you download an official yolov4 model?

wget https://nvidia.box.com/shared/static/511552h6b1ecw4gd20ptuihoiidz13cs -O models.zip

See deepstream_tao_apps/pgie_yolov4_tao_config.txt at master · NVIDIA-AI-IOT/deepstream_tao_apps · GitHub ,
it is a 960x544 model. And tlt-model-key=nvidia_tlt

thanks for your fast response.

I managed to download and convert the official yoloV4 model and it works fine. With that in mind, I noticed that both the official yoloV4 model and the peoplesegnet_resnet50 I mentioned above both use int8 with calibration file. So I exported my custom trained yoloV4 model as int8 and then I managed to create the engine file.

I can run it in the deepstream-app but it doesn’t produce bounding boxes. For the GPU or DLA version. Is there a common mistake? I can use int8 models trained on detectNet with no issues?

No, it should not have mistake.

Can you run GitHub - NVIDIA-AI-IOT/deepstream_tao_apps: Sample apps to demonstrate how to deploy models trained with TAO on DeepStream with the video file inside the deepstream successfully?

I’m currently training a new model and will export it as int8. It’s strange an fp16 model won’t build on dla settings and a int8 engine will.

I have retrained the model and it works perfectly with int32. However I have the same issue with int8, there are no bboxes being displayed. I see this post had the same issue

As you suggested I can run the app successfully wit the video file in the sample.

I export as below-

#Uncomment to export in INT8 mode (generate calibration cache file).
!tao yolo_v4 export -m $USER_EXPERIMENT_DIR/experiment_dir_retrain/weights/yolov4_darknet19_epoch_200.tlt
-o $USER_EXPERIMENT_DIR/export/yolov4_resnet18_epoch_$EPOCH.etlt
-e $SPECS_DIR/yolo_v4_retrain_resnet18_kitti.txt
-k $KEY
–cal_image_dir $USER_EXPERIMENT_DIR/data/testing/image_2
–data_type int8
–batch_size 16
–batches 100
–cal_cache_file $USER_EXPERIMENT_DIR/export/cal.bin
–cal_data_file $USER_EXPERIMENT_DIR/export/cal.tensorfile

And receive this message-

“The ONNX operator number change on the optimization: 483 → 232
2022-02-04 12:29:02,398 [INFO] keras2onnx: The ONNX operator number change on the optimization: 483 → 232
2022-02-04 12:29:03,855 [INFO] iva.common.export.base_exporter: Generating a tensorfile with random tensor images. This may work well as a profiling tool, however, it may result in inaccurate results at inference. Please generate a tensorfile using the tlt-int8-tensorfile, or provide a custom directory of images for best performance.”

I have images in the directory. I will continue testing, any suggestions would be appreciated. I have around 10k images and batches set to 100. Do you think this could be the issue? I will try with 1000.

looks like I had an issue with the below. I have resolved it and now I dont get the same message with “provide a custom directory of images for best performance.” i will provide feedback on Monday.

{
“Mounts”: [
{
“source”: “/home/ubuntu/cv_samples_v1.2.0/yolo_v4/LOCAL_PROJECT_DIR”,
“destination”: “/workspace/tao-experiments”
},
{
“source”: “/home/ubuntu/cv_samples_v1.2.0/yolo_v4/specs”,
“destination”: “/workspace/tao-experiments/yolo_v4/specs”
}
]
}

This thread also has the same symptoms I’m seeing.

Why do fp16/32 engines work fine and int8 does not?

Can I send you me .etlt file and see if you get the same results?

If there is the log of "iva.common.export.base_exporter: Generating a tensorfile with random tensor images. ” , then there is something wrong in the commands “cal_image_dir” , “batch_size” and “batches”.

If after int-8 calibration the accuracy of the int-8 inferences seem to degrade, it could be because that there wasn’t enough data in the calibration tensorfile used to calibrate the model or, the training data is not entirely representative of your test images, and the calibration maybe incorrect. Therefore, you may either regenerate the calibration tensorfile with more batches of the training data and recalibrate the model, or calibrate the model on a few images from the test set.

Thanks again for your fast response.

So far-

  • Nvidia AWS ami instance
  • Nvidia yoloV4 notebook
  • Nvidia Tao-converter fp16/32 works perfectly with great results
  • Nvidia tao-converter int8 completed but complains about wrong input values. Change to recommended generates the engine
  • engine/bin loads in Deepstream app but does not generate bboxes.

Conclusión/thought process

  • training is fine as fo16/32 engines are great
  • tao-convert is happy and produces int8 engine but Deepstream fails to like it.
  • same process with detectNet works perfectly. fp16/32/int8

From the threads I have found. There is some confusion on yolo int8, with none of the threads being closed with a solution.

After many experiments. The yoloV4 darknet53 models are by far the most accurate. But slower, hence the need for int8 optimization.

I’m happy to provide my model.

I’m not making any progress with int8. I am retraining with “enable_qat: true”. I’m hoping this is going to solve my problems.

Can you share the latest command and log when you run “tao yolo_v4 export” ?

Please see below-

2022-02-07 10:31:57,210 [INFO] root: Registry: [‘nvcr.io’]
2022-02-07 10:31:57,286 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3
2022-02-07 10:31:57,294 [WARNING] tlt.components.docker_handler.docker_handler:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the “user”:“UID:GID” in the
DockerOptions portion of the “/home/ubuntu/.tao_mounts.json” file. You can obtain your
users UID and GID by using the “id -u” and “id -g” commands on the
terminal.
Using TensorFlow backend.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
2022-02-07 10:32:03,884 [INFO] root: Building exporter object.
2022-02-07 10:32:12,825 [INFO] root: Exporting the model.
2022-02-07 10:32:12,825 [INFO] root: Using input nodes: [‘Input’]
2022-02-07 10:32:12,825 [INFO] root: Using output nodes: [‘BatchedNMS’]
2022-02-07 10:32:12,825 [INFO] iva.common.export.keras_exporter: Using input nodes: [‘Input’]
2022-02-07 10:32:12,825 [INFO] iva.common.export.keras_exporter: Using output nodes: [‘BatchedNMS’]
The ONNX operator number change on the optimization: 483 → 232
2022-02-07 10:32:57,201 [INFO] keras2onnx: The ONNX operator number change on the optimization: 483 → 232
2022-02-07 10:32:58,620 [INFO] iva.common.export.base_exporter: Generating a tensorfile with random tensor images. This may work well as a profiling tool, however, it may result in inaccurate results at inference. Please generate a tensorfile using the tlt-int8-tensorfile, or provide a custom directory of images for best performance.
100%|█████████████████████████████████████| 1000/1000 [2:10:30<00:00, 7.83s/it]
2022-02-07 12:43:28,850 [INFO] iva.common.export.keras_exporter: Calibration takes time especially if number of batches is large.
2022-02-07 12:43:28,850 [INFO] root: Calibration takes time especially if number of batches is large.
2022-02-07 13:15:06,252 [INFO] iva.common.export.base_calibrator: Saving calibration cache (size 7084) to /workspace/tao-experiments/yolo_v4/export/cal.bin
2022-02-07 13:17:43,761 [INFO] root: Export complete.
2022-02-07 13:17:43,761 [INFO] root: {
“param_count”: 31.572441,
“size”: 121.07246398925781
}
2022-02-07 13:17:46,306 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Can you share the full command?

My images are 1280x720 and I have the output set to

output_width: 960
output_height: 544

Could this be causing issues? I see from this post Network Image Input Resizing - #6 by Morganh That the dataloader either pads with zeros, or crops to fit to the output resolution.

!tao yolo_v4 export -m $USER_EXPERIMENT_DIR/experiment_dir_retrain/weights/yolov4_darknet19_epoch_200.tlt
-o $USER_EXPERIMENT_DIR/export/yolov4_resnet18_epoch_$EPOCH.etlt
-e $SPECS_DIR/yolo_v4_retrain_resnet18_kitti_seq.txt
-k $KEY
–cal_image_dir $USER_EXPERIMENT_DIR/data/testing/image_2
–data_type int8
–batch_size 16
–batches 1000
–cal_cache_file $USER_EXPERIMENT_DIR/export/cal.bin
–cal_data_file $USER_EXPERIMENT_DIR/export/cal.tensorfile

Can you change to all of your training images?

so instead of the testing folder try the training folder of images? I will give it a try.

Thanks!

Yes, the cal.bin is generated by training dataset.

If after int-8 calibration the accuracy of the int-8 inferences seem to degrade, it could be because there wasn’t enough data in the calibration tensorfile used to calibrate thee model or, the training data is not entirely representative of your test images, and the calibration maybe incorrect. Therefore, you may either regenerate the calibration tensorfile with more batches of the training data and recalibrate the model, or calibrate the model on a few images from the test set.