TAO converter - INT8 engine generated with YOLOV4(CSPDarknet53) gives wrong predictions(0 mAP) for models trained with fish-eye datasets

This is a follow-up post to this original question we posted in this forum. Unfortunately that post has been closed (not solved in our end). I am creating an updated post here with the new information we got though more testing. The main additional information is that we have narrowed down that this problem only occurs for models trained with fisheye dataset.

Problem

We have trained and tested TLT YOLOv4(CSPDarknet52 and resnet18) models with a dataset of person class with TAO. We are training 2 sets of models:

  • for fish-eye datasets
  • for standard camera datasets

We have exported .etlt models and generated calibration cache files with and then converted the models to tensorRT .engine files with tao-converter. . All the models converts without an error.
Everything works fine with the models trained on standard-camera view images. We get the following issue with models trained with/ calibrated with fisheye dataset when the training does not use QAT. The models which uses fisheye dataset and trained with QAT also works fine in INT8.

With models trained without QAT and using fish-eye dataset, predictions made by YOLOv4(CSPDarknet53) when converted to TensorRT with INT8 precision are wrong and therefore PASCAL 2010 mAP is 0. But the same model when converted to TensorRT with fp16 and fp32 precisions gives correct results.
We have tested the following with the same fish-eye dataset.

  • TAO - YOLOv4(resnt18) it works in all fp16, fp32 and int8 precisions.
  • Using this open-source project to convert YOLOv4 darknet model to onnx> TensorRT and then performing INT8 calibration and conversion - this works fine too.
    So at this point it seems less likely that its an issue with TensorRT as the above 2 methods gives good performance. We assume it could be an issue with TAO - YOLOV4(CSPDarknet53) model where it is unable to handle fisheye data/ has an issue using INT8 calibration with fish-eye data. But it is just a guess as we have tried so many options to narrow down the issue.

In addition, we have also tested our training/export and conversion steps by training a model with public KTTI dataset and testing INT8 engine accuracy as instructed by NVIDIA moderator.

Information .

• Hardware:

GPU used for Training: NVIDIA Tesla V100 GPUs
GPUs used for tao-convert and Inferencing

  • for fp32 and int8 - GTX 1060 (GPU_ARCHS = 6.1)
  • for fp32 and fp16 - Quadro RTX4000 (GPU_ARCHS = 7.5 )

• Network Type: Yolo_v4 (CSPDarknet53)

• Platform and TAO-conveter details

We have tested with and without docker when trying to narrow down the issue and we have achieved the same results.

Platform: Ubuntu-1804-amd64

  • Without docker
    CUDA version:11.1 and CuDNN version- 8.1
    tao-conveter: cuda111-cudnn80-trt72
    TensoRT version: 7.2.3.4
    TensorRT OSS Plugins: 06.21 - Built plugins and have replaced original libnvinfer_plugin.so as instructed

  • With docker
    option1 - tensorrt:21.09-py3 + https://developer.nvidia.com/tao-converter-80
    option 2 - default converter that comes with tao docker

Configuration of the TAO Toolkit Instance
dockers: ['nvidia/tao/tao-toolkit-tf', 'nvidia/tao/tao-toolkit-pyt', 'nvidia/tao/tao-toolkit-lm']
format_version: 1.0
toolkit_version: 3.21.08
published_date: 08/17/2021

• Other Info:

  • Calibration dataset consist of 9286 randomly selected images from the training set (Model export settings batches =1160 and batch_size =8 ).
  • We are not using deepstream, but using TensorRT python API to do the inference.

This kind of issue(no bboxes) resulted from cal.bin. For KITTI dataset, I attached one cal.bin. See Inference YOLO_v4 int8 mode doesn't show any bounding box - #30 by Morganh .
You can use it for test.

we dont have no bboxes issue. We get bboxes, but they are wrong and confidence is low (not for KITTI dataset - we get the same accuracy as the accuracy you mentioned for KITTI dataset).

what I meant to express here was that we followed the instructions and got the same results for KITTI dataset as provided by the moderator - so we know our steps are correct.

OK, thanks for the clarification. So, according to your experiments,
YOLOv4(resnet18): run inference well in int8 precision against fisheye dataset
YOLOv4(CSPDarknet53): run inference well in int8 precision against standard camera datasets
YOLOv4(CSPDarknet53): run inference well in int8 precision which trained with QAT against fisheye dataset
YOLOv4(CSPDarknet53): cannot run inference well in int8 precision which trained with PTQ against fisheye dataset

Could you please upload

  1. Upload the cal.bin generated by YOLOv4(resnet18) without QAT.
  2. Upload the cal.bin generated by YOLOv4(CSPDarknet53) without QAT
  3. Share an inferenced result about “predictions made by YOLOv4(CSPDarknet53) when converted to TensorRT with INT8 precision are wrong”

More, is there a public fish-eye dataset which can reproduce the issue?

Hi,
For the failed case (non-QAT model has 0 mAP), could you please share us with the log when you run "tao yolo_v4 export xxx " ?

Hi Morganh, please find the full log for tao yolo_v4 export command

command

tao yolo_v4 export \
-e \
/workspace/tlt-experiments/trainings/conf/TAO_yolov4_config_1_70_20_10_split_120.yml \
-m \
/workspace/tlt-experiments/trainings/$results_dir/weights/yolov4_cspdarknet53_epoch_120.tlt \
--cal_data_file \
/workspace/tlt-experiments/trainings/$results_dir/cal_1160batches.tensorfile \
--cal_cache_file \
/workspace/tlt-experiments/trainings/$results_dir/cal_1160batches.bin \
--cal_image_dir \
/workspace/tlt-experiments/data/calibration/images/ \
-k <key> \
--data_type int8 \
-o \
/workspace/tlt-experiments/trainings/$results_dir/model.etlt \
--gen_ds_config \
--verbose \
--batch_size 8 \
--batches 1160 \
--engine_file \
/workspace/tlt-experiments/trainings/$results_dir/int8.engine

log

2021-11-29 02:59:13,056 [INFO] root: Registry: ['nvcr.io']
2021-11-29 02:59:13,140 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/ubuntu/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
Using TensorFlow backend.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
2021-11-29 02:59:36,809 [INFO] iva.common.export.keras_exporter: Using input nodes: ['Input']
2021-11-29 02:59:36,809 [INFO] iva.common.export.keras_exporter: Using output nodes: ['BatchedNMS']
2021-11-29 03:01:31,873 [DEBUG] modulus.export._uff: Patching keras BatchNormalization...
2021-11-29 03:01:31,874 [DEBUG] modulus.export._uff: Patching keras Dropout...
2021-11-29 03:01:31,874 [DEBUG] modulus.export._uff: Patching UFF TensorFlow converter apply_fused_padding...
2021-11-29 03:01:37,733 [DEBUG] modulus.export._uff: Unpatching keras BatchNormalization layer...
2021-11-29 03:01:37,734 [DEBUG] modulus.export._uff: Unpatching keras Dropout layer...
The ONNX operator number change on the optimization: 771 -> 363
2021-11-29 03:02:16,861 [INFO] keras2onnx: The ONNX operator number change on the optimization: 771 -> 363
2021-11-29 03:02:16,877 [DEBUG] modulus.export._onnx: Model converted to ONNX, checking model validity with onnx.checker.
2021-11-29 03:02:19,868 [DEBUG] iva.common.export.base_exporter: Data file doesn't exist. Pulling input dimensions from the network.
2021-11-29 03:02:19,869 [DEBUG] iva.common.export.keras_exporter: Input dims: (3, 416, 416)
2021-11-29 03:02:19,974 [DEBUG] iva.common.export.tensorfile: Opening /workspace/tlt-experiments/trainings/results_TAO_config_1_70_20_10_split_120/cal_1160batches.tensorfile with mode=w
1160it [13:28,  1.44it/s]
2021-11-29 03:15:48,179 [DEBUG] iva.common.export.tensorfile: Opening /workspace/tlt-experiments/trainings/results_TAO_config_1_70_20_10_split_120/cal_1160batches.tensorfile with mode=r
2021-11-29 03:15:48,180 [INFO] iva.common.export.keras_exporter: Calibration takes time especially if number of batches is large.
2021-11-29 03:16:02,014 [DEBUG] iva.common.export.base_calibrator: read_calibration_cache - no-op
2021-11-29 03:27:10,959 [DEBUG] iva.common.export.base_calibrator: read_calibration_cache - no-op
2021-11-29 03:27:10,959 [INFO] iva.common.export.base_calibrator: Saving calibration cache (size 11340) to /workspace/tlt-experiments/trainings/results_TAO_config_1_70_20_10_split_120/cal_1160batches.bin
2021-11-29 03:30:08,834 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Please find the calibration files below:

Resnet34 (without QAT)
cal_resnet34_1160batches.bin (8.4 KB)
CSPDarknet53 (without QAT)
cal_cspdarknet53_1160batches.bin (11.1 KB)
CSPDarknet53 (with QAT)
cal_cspdarknet53_QAT_1160batches.bin (4.2 KB)

Data:

Its not possible to share images due to company policy. However I can share detections for few images. not sure if that will be useful without the image though. please let me know for what purpose you would like to see results.

we dont know/ have not used a public fisheye dataset for object detection.

Actually I never reproduce 0 mAP for int8. You can see some similar topic like TLT YOLOv4 (CSPDakrnet53) - TensorRT INT8 model gives wrong predictions (0 mAP) - Intelligent Video Analytics / TAO Toolkit - NVIDIA Developer Forums and Deepstream infrence gives no detection - Intelligent Video Analytics / TAO Toolkit - NVIDIA Developer Forums . That’s the reason why I am asking the detailed steps and some sample images.

Currently, we find that if end user gets the log of "iva.common.export.base_exporter: Generating a tensorfile with random tensor images. ” (like Deepstream infrence gives no detection - Intelligent Video Analytics / TAO Toolkit - NVIDIA Developer Forums) , then there are something wrong in the commands “cal_image_dir” , “batch_size” and “batches”.

I’m still digging out your log “Data file doesn’t exist. Pulling input dimensions from the network.” .

More, can you run inference well with the non-qat model and with-qat cal.bin ?

1 Like

thanks Morganh,

we dont get this error you have mentioned iva.common.export.base_exporter: Generating a tensorfile with random tensor images

about the log Data file doesn’t exist. Pulling input dimensions from the network. this log can be seen during QAT model export as well. I have not saved the log for resenet34 (non-QAT), will check again and let you know when I get some time. For your information, the following is the log for QAT model for tao yolo_v4 export. Based on this log, it seems like the QAT model doesnt really use calibration images for generating the scales, instead somehow it pulls it from network/tlt model. Is that correct? Based on this blog, when the model is trained with qat the calibration file is directly generated by extracting information from model rather than following the normal calibration using the provided calibration data.

log for yolov4(CSPDarknet53) QAT model for tao yolo_v4 export

2021-11-29 03:42:16,359 [INFO] root: Registry: ['nvcr.io']
2021-11-29 03:42:16,443 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/ubuntu/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
Using TensorFlow backend.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
2021-11-29 03:42:42,810 [INFO] iva.common.export.keras_exporter: Using input nodes: ['Input']
2021-11-29 03:42:42,811 [INFO] iva.common.export.keras_exporter: Using output nodes: ['BatchedNMS']
2021-11-29 03:59:35,180 [DEBUG] modulus.export._uff: Patching keras BatchNormalization...
2021-11-29 03:59:35,181 [DEBUG] modulus.export._uff: Patching keras Dropout...
2021-11-29 03:59:35,182 [DEBUG] modulus.export._uff: Patching UFF TensorFlow converter apply_fused_padding...
2021-11-29 03:59:41,347 [DEBUG] modulus.export._uff: Unpatching keras BatchNormalization layer...
2021-11-29 03:59:41,347 [DEBUG] modulus.export._uff: Unpatching keras Dropout layer...
The ONNX operator number change on the optimization: 771 -> 363
2021-11-29 04:00:21,846 [INFO] keras2onnx: The ONNX operator number change on the optimization: 771 -> 363
2021-11-29 04:00:21,863 [DEBUG] modulus.export._onnx: Model converted to ONNX, checking model validity with onnx.checker.
2021-11-29 04:00:24,825 [DEBUG] iva.common.export.base_exporter: Data file doesn't exist. Pulling input dimensions from the network.
2021-11-29 04:00:24,825 [DEBUG] iva.common.export.keras_exporter: Input dims: (3, 416, 416)
Tensors in scale dictionary but not in network: {'yolo_spp_pool_3/MaxPool:0', 'b3_final_trans/convolution:0', 'b4_final_trans/convolution:0', 'yolo_spp_pool_1/MaxPool:0', 'yolo_spp_pool_2/MaxPool:0'}
2021-11-29 04:05:28,194 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

More, can you run inference well with the non-qat model and with-qat cal.bin ? . What does this mean/ what is expected to see? please let me know. will do this when I get some time and let you know.

sample images - do you need to inspect (very few images with predictions) or to be used for training? We could provide very few sample images that I can share, but still it is not possible to post it in a public forum.

about 0 mAP:
We get

  • 0 mAP when the model is converted with cuda111-cudnn80-trt73(TRT7.2) converter (on GTX 1060)
  • 9.43% mAP when the model is converted with cuda113-cudnn80-trt80 converter (TRT8.0) (on GTX 1060).
  • when we generate a .engine file during the tao yolo_v4 export stage and evaluate with yolo_v4 evaluate, we get an accuracy of around 22-30% (on AWS EC2 instances- NVIDIA Tesla V100 GPUs). I have explained this during one our previous discussions on the initial post. Referencing it in case it would be useful for debugging:
    TLT YOLOv4 (CSPDakrnet53) - TensorRT INT8 model gives wrong predictions (0 mAP) - #33 by kgksl

Yes. When exporting a model trained with QAT enabled, the tensor scale factors to calibrate the activations are peeled out of the model and serialized to a TensorRT readable cache file defined by the cal_cache_file argument.

Since the with-qat cal.bin can inference well, I want to know if it can be used during the inference when you run against non-qat model.

Yes, you do not need to post publicly. If possible, please share some images with me via sending the private forum message to me, or you can share in a google drive folder and add access to me.
My purpose is to reproduce. Suggest you to add the .tlt and .yml file as well.

So, you are not always getting 0 mAP. Please change threshold when you run inference or evaluation. Refer to Difference in mAP between tlt evaluate and tlt inference - #15 by Morganh

about confidence thresholds:
we are using our script to do do inference(TensorRT) and we are using a confidence_threshold=0.001 as the confidence threshold. And using the same model trained with the same configs. I dont think their is an issue with confidence thresholds as we got similar results for other models (for example for the above mentioned resnt34 model, tao evaluate results was 80% and our inference based on TensorRT and our evaluation script’s result was 79.9%.

thanks. I need to check if I can share the trained model. Will let you know.

Hi, was it possible to identify any abnormalities in above calibration cache file? Did they gave any indication of a possible issue?

For YOLOv4(CSPDarknet53) under PTQ mode, it does not make sense that standard camera datasets can work but fisheye dataset cannot work.

Previously you already upload the cal.bin from fisheye dataset.
Could you please upload the cal.bin from standard camera datasets as well?
We can check if there is difference.

cal.bin (11.1 KB)
sure. Please find the above calibration cache file for YOLOv4(CSPDarknet53) - trained for standard camera without enabling QAT.

could you please share your email address with which I can share the trained models and sample images (via google drive).

Already send a private forum message to you with my email address.

1 Like

Comparing the two cal.bin files

  • fisheye
  • standard camera datasets

It is not abnormal in fisheye’s cal.bin.

Did you ever have the log how you generate cal.bin for standard camera datasets?
You can use similar command and generate cal.bin for fisheye dataset.
You can also try “–batch_size 1” and more " --batches ". Please use the exact training images to generate cal.bin as much as possible.

Did you ever have the log how you generate cal.bin for standard camera datasets?
You can use similar command and generate cal.bin for fisheye dataset.

Yes we are using the same command and please refer to this/ previous discussion threads for the exact command. I dont believe there is any issues as we use the same command and as you have already checked it few times.

You can also try “–batch_size 1” and more " --batches ". Please use the exact training images to generate cal.bin as much as possible.

could you please explain the idea behind this? we can try “–batch_size 1” and more " --batches but then it wont be same as the command used for standard-camera export which works. And what is the idea behind using a batch_size 1 - in what kind of situations this is preferred over a larger value?

Please use the exact training images to generate cal.bin as much as possible could you please explain this, it is not clear to me. Thanks

The more data provided during calibration, the closer int8 inferences are to fp32 inferences.

If after int-8 calibration the accuracy of the int-8 inferences seem to degrade, it could be because that there wasn’t enough data in the calibration tensorfile used to calibrate the model or, the training data is not entirely representative of your test images, and the calibration maybe incorrect. Therefore, you may either regenerate the calibration tensorfile with more batches of the training data and recalibrate the model, or calibrate the model on a few images from the test set.

For changing batch_size or batches, just want to narrow down.