TLT YOLOv4 (CSPDakrnet53) - TensorRT INT8 model gives wrong predictions (0 mAP)

Problem

We have trained and tested TLT YOLOv4(CSPDarknet52 and resnet18) models with a dataset of person class with TLT. We have exported .etlt models and generated calibration cache files with yolo_v4 export and then converted the models to tensorRT .engine files with tlt-converter. All the models converts without an error.

But the predictions made by YOLOv4(CSPDarknet53) when converted to TensorRT with INT8 precision are wrong and therefore PASCAL 2010 mAP is 0. But the same model when converted to TensorRT with fp16 and fp32 precisions gives correct results.

Also we have tested YOLOv4(resnt18) it works in all fp16, fp32 and int8 precisions. So there is a problem with YOLOv4(CSPDakrnet53) when its converted with tlt-converter into TensorRT with int8 precision.

Has anyone faced a similar issue/ is this a known issue. We shall be grateful if you someone can provide some suggestions to solve this issue.

Information.

• Hardware:

GPU used for Training: NVIDIA Tesla V100 GPUs
GPUs used for tlt-convert and Inferencing

  • for fp32 and int8 - GTX 1060 (GPU_ARCHS = 6.1)
  • for fp32 and fp16 - Quadro RTX4000 (GPU_ARCHS = 7.5 )

• Network Type: Yolo_v4 (CSPDarknet53)

• Platform and TLT-conveter details (Didnt use tlt docker)

Platform: Ubuntu-1804-amd64
CUDA version:11.1 and CuDNN version- 8.0.5
tlt-conveter: cuda11.1_cudnn8.0_trt7.2-20210304T191646Z-001.zip
TensoRT version: 7.2.2.3
TensorRT OSS Plugins: Built plugins and have replaced original libnvinfer_plugin.so as instructed

• Other Info:

  • INT8 calibration - used 10% of training data as instructed here.
  • We are not using deepstream, but using TensorRT python API to do the inference.

commands and logs

tlt yolo_v4 export command:

tlt yolo_v4 export -e /workspace/tlt-experiments/trainings/conf/yolov4_config_1.yml -m /workspace/tlt-experiments/trainings/results_config_1/weights/yolov4_cspdarknet53_epoch_090.tlt --cal_data_file /workspace/tlt-experiments/trainings/results/cal_cspdarknet53_348batches.tensorfile --cal_cache_file /workspace/tlt-experiments/trainings/results/cal_cspdarknet53_348batches.bin --cal_image_dir /workspace/tlt-experiments/data/calibration/images/ -k key --data_type int8 --batches 348 --batch_size 8 -o /workspace/tlt-experiments/trainings/results/yolov4_cspdarknet53_epoch_90_int8_348batches.etlt

log for tlt yolo_v4 export

2021-08-27 00:28:44,439 [INFO] root: Registry: ['nvcr.io']
2021-08-27 00:28:44,496 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the ~/.tlt_mounts.json file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
Using TensorFlow backend.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
2021-08-27 00:29:07,227 [INFO] iva.common.export.keras_exporter: Using input nodes: ['Input']
2021-08-27 00:29:07,228 [INFO] iva.common.export.keras_exporter: Using output nodes: ['BatchedNMS']
The ONNX operator number change on the optimization: 771 -> 363
2021-08-27 00:31:44,448 [INFO] keras2onnx: The ONNX operator number change on the optimization: 771 -> 363
30it [00:41,  1.39s/it][[A
2021-08-27 00:32:29,114 [INFO] iva.common.export.keras_exporter: Calibration takes time especially if number of batches is large.
2021-08-27 00:36:22,850 [INFO] iva.common.export.base_calibrator: Saving calibration cache (size 11340) to /workspace/tlt-experiments/trainings/results/cal_348batches.bin
2021-08-27 00:39:22,369 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

./tlt-converter command

./tlt-converter -k key -d 3,416,416 -o BatchedNMS <pathtoTLT>/trainings/yolov4_config_1_90epoch/INT8/cal_cspdarknet53_348batches_3.bin -e <pathtoTLTExperiments>/YOLOv4_experiments/tlt-tensorrrt/models/yolov4_int8_cal_348batches.engine -t int8 -i nchw -p Input,1x3x416x416,1x3x416x416,1x3x416x416 <pathtoTLT>/TLT/trainings/yolov4_config_1_90epoch/INT8/yolov4_cspdarknet53_epoch_90_int8_348batches_3.etlt

/tlt-converter command logs in GTX 1060 (has INT8 support, doesn’t have FP16 Support) for YOLOv4(CSPDarknet53)

[WARNING] onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[INFO] ModelImporter.cpp:135: No importer registered for op: BatchedNMSDynamic_TRT. Attempting to import as plugin.
[INFO] builtin_op_importers.cpp:3771: Searching for plugin: BatchedNMSDynamic_TRT, plugin_version: 1, plugin_namespace: 
[INFO] builtin_op_importers.cpp:3788: Successfully created plugin: BatchedNMSDynamic_TRT
[INFO] Detected input dimensions from the model: (-1, 3, 416, 416)
[INFO] Model has dynamic shape. Setting up optimization profiles.
[INFO] Using optimization profile min shape: (1, 3, 416, 416) for input: Input
[INFO] Using optimization profile opt shape: (1, 3, 416, 416) for input: Input
[INFO] Using optimization profile max shape: (1, 3, 416, 416) for input: Input
[WARNING] Half2 support requested on hardware without native FP16 support, performance will be negatively affected.
[INFO] Reading Calibration Cache for calibrator: EntropyCalibration2
[INFO] Generated calibration scales using calibration cache. Make sure that calibration cache has latest scales.
[INFO] To regenerate calibration cache, please delete the existing one. TensorRT will generate a new calibration cache.
[WARNING] Missing dynamic range for tensor (Unnamed Layer* 332) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[WARNING] Missing dynamic range for tensor (Unnamed Layer* 440) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[WARNING] TensorRT was linked against cuBLAS/cuBLAS LT 11.3.0 but loaded cuBLAS/cuBLAS LT 11.2.1
[INFO] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[INFO] Detected 1 inputs and 4 output network tensors.

Half2 support requested on hardware without native FP16 support, performance will be negatively affected. - I dont get why we get this error as we have set --t int8. But I guess this is not the issue because we get the same warning for YOLOv4(resnet18) for this GPU, but still it gives correct predictions.

Just for reference, /tlt-converter command logs in GTX 1060 (has INT8 support, doesn’t have FP16 Support) for YOLOv4(resnet18) which works

[WARNING] onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[WARNING] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[INFO] ModelImporter.cpp:135: No importer registered for op: BatchedNMSDynamic_TRT. Attempting to import as plugin.
[INFO] builtin_op_importers.cpp:3771: Searching for plugin: BatchedNMSDynamic_TRT, plugin_version: 1, plugin_namespace: 
[INFO] builtin_op_importers.cpp:3788: Successfully created plugin: BatchedNMSDynamic_TRT
[INFO] Detected input dimensions from the model: (-1, 3, 416, 416)
[INFO] Model has dynamic shape. Setting up optimization profiles.
[INFO] Using optimization profile min shape: (1, 3, 416, 416) for input: Input
[INFO] Using optimization profile opt shape: (1, 3, 416, 416) for input: Input
[INFO] Using optimization profile max shape: (1, 3, 416, 416) for input: Input
[WARNING] Half2 support requested on hardware without native FP16 support, performance will be negatively affected.
[INFO] Reading Calibration Cache for calibrator: EntropyCalibration2
[INFO] Generated calibration scales using calibration cache. Make sure that calibration cache has latest scales.
[INFO] To regenerate calibration cache, please delete the existing one. TensorRT will generate a new calibration cache.
[WARNING] Missing dynamic range for tensor (Unnamed Layer* 210) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[WARNING] Missing dynamic range for tensor (Unnamed Layer* 318) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[WARNING] TensorRT was linked against cuBLAS/cuBLAS LT 11.3.0 but loaded cuBLAS/cuBLAS LT 11.2.1
[INFO] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[INFO] Detected 1 inputs and 4 output network tensors.

We can see this warnings Missing dynamic range for tensor for 2 layers, for both resnet18 and CSPDarkne53 as well. In this post the advice was to ignore it for a similar warning, but for a different network, so not sure how applicable to our issue.

To narrow down, can you generate trt int8 engine inside tlt-docker via the default tlt-converter? Then run inference inside the tlt-docker.

More, please see if below helps. See Release Notes - NVIDIA Docs

  • When generating int8 engine with tao-converter , please use -s if there is TensorRT error message saying weights are outside of fp16 range.

@Morganh I am not using TAO Toolkit. I was following TLT YOLOv4 documentation and used tlt-conveter YOLOv4 — Transfer Learning Toolkit 3.0 documentation (tlt-conveter: cuda11.1_cudnn8.0_trt7.2-20210304T191646Z-001.zip)

Just to confirm, we need to convert models to INT8 precision in cases where FP16 is not supported by GPU, but INT8 support is provided.

(a side note:

  • we have tested INT8 conversion on a GPU that supports INT8 and FP16, and on a GPU that doesn’t support FP16, but supports INT8, and we have got the same results with YOLOv4(CSPDarknet53) (o mAP))
  • Also we got good results with YOLOv4(resnet18) in INT8 mode despite the above warnings.We are having a problem with YOLOV4(CSPDarknet53). )

thanks for the quick response. will test the default tlt-converter provided with tlt-docker

we were using tlt-conveter, tlt-export etc and not tao.
Just realised tlt has been renamed to tao.

during the tests we tried using -s option you have mentioned above (during export and convert steps). But it resulted in many warnings and the resulting model didnt fixed the issue.

Are there any updates in tao that is not avaible in tlt that might solve this issue?

Tao-converter is just the renaming of tlt-converter.
For 0 mAP of int8 model, it mostly results from cal.bin. How many images did you use to generate cal.bin?

thanks for getting back quickly.

we used 2784 images to generate cal.bin
(batches=348 * batch_size=8)

we use 27858 training images and we have used nearly 10% of training data (randomly sampled) for calibration.

Could you please try to add more images?

We tried using 30% of training data for calibration (8352 images), but we got similar outcomes with YOLOV4(CSPDarknet53) on both GPUs we tested.

  • Quadro RTX4000 - INT8 precision : PASCAL10 mAP@0.5 7.86%

  • GTX 1060 - INT8 precision : PASCAL10 mAP@0.5 0%

Based on your previous suggestion, I tried using the default converter that comes with tao docker instead of stand-alone tao-conveter, but got the same results (0 mAP)

>> tao info::
Configuration of the TAO Toolkit Instance
dockers: ['nvidia/tao/tao-toolkit-tf', 'nvidia/tao/tao-toolkit-pyt', 'nvidia/tao/tao-toolkit-lm']
format_version: 1.0
toolkit_version: 3.21.08
published_date: 08/17/2021

However we could get good results with INT8 precision when we performed QAT enabled training. For both GPUs we got mAP of around 82% on our test set in INT8 precision. Based on the documentation and our previous TensorRT implementations outside TAO, QAT is optional and we should still get reasonable performance with by performing calibration. So we are wondering whats the cause for this.

As mentioned before, we got good results with YOLOV4(resnet18) backbone in INT8 precision, with even 10% of calibration data. Also YOLOV4(CSPDarknet53) works fine in other modes (FP16/ FP32).
What do you think is the cause for this issue in INT8 of YOLOv4 with CSPDarknet53 backbone? Would it be beneficial to report this an issue?

According to latest comment , so for YOLOv4(CSPDarknet53)

  1. If you trained a model with QAT enabled, the mAP is around 82% . You get this value while running tlt-evaluate against the trt int8 engine , right?
  2. If you trained a model without QAT enabled, the mAP is 0 ?

Hi @Morganh , thanks for getting back to me.

for YOLOv4(CSPDarknet53)

If you trained a model with QAT enabled, the mAP is around 82% .
If you trained a model without QAT enabled, the mAP is 0 ?

Yes, but note that I only have this problem for tensorRT INT8 precision. In FP32/ FP16, both gets around 82%-84% accuracy.

 You get this value while running tlt-evaluate against the trt int8 engine , right?

No. I am using a python script to load and run model, and do pre/post processing. I have verified I get the same results from scripts as tao evaluate with .tlt model, but have not tested with tao evaluate + INT8 .engine file.
Also used the following reference for pre/post processing:

@Morganh I ran tests with tao evaluate + .engine for each engine:
I still got 0 mAP for the model trained without QAT, in INT8 precision, but FP32 engine converted from the same model achieved 83% mAP.

And model trained with QAT, in INT8 precision achieved 81.2% mAP with tao evaluate + .engine

So, the culprit may result in training without QAT .

So, the culprit may result in training without QAT

  • As mentioned before, we got good results with YOLOV4(resnet18) backbone in INT8 precision, with even 10% of calibration data (without QAT).
  • Based on the documentation and our previous TensorRT implementations outside TAO (without QAT), QAT is optional and we should still get reasonable performance with by performing calibration.

Based on our experience this is specific to YOLOv4 with CSPDarknet53 backbone in INT8? Would it be beneficial to report this an issue?

Could you help add “--force_ptq” flag when you export the tlt model, and then retry?

--force_ptq : Flag to force post training quantization for QAT models.

More, I will try to reproduce your result with KITTI public dataset.

1 Like

Could you help add “ –force_ptq ” flag when you export the tlt model, and then retry?
could you pleas explain how this may help? and how do we find for which models we need to apply vs not? thanks

It is for DLA specific case.
See DetectNet_v2 — TAO Toolkit 3.22.05 documentation
However, the current version of QAT doesnt natively support DLA int8 deployment on Jetson. To deploy this model on Jetson with DLA int8 , use the --force_ptq flag to use TensorRT post-training quantization to generate the calibration cache file.

And https://developer.nvidia.com/blog/improving-int8-accuracy-using-quantization-aware-training-and-tao-toolkit/
To deploy this model with the DLA, you must generate the calibration cache file using PTQ on the QAT-trained .tlt model file. You can do this by setting the force_ptq flag over the command line when running export .

we are not using DLA. And also we are having an issue with training without enabling QAT (not QAT enabled). so as I undertand, we dont need to use –force_ptq. Is that correct? Please let me know. thanks alot for the quick response.