TLT YOLOv4 (CSPDakrnet53) - TensorRT INT8 model gives wrong predictions (0 mAP)

More, I will try to reproduce your result with KITTI public dataset.

1 Like

Could you help add “ –force_ptq ” flag when you export the tlt model, and then retry?
could you pleas explain how this may help? and how do we find for which models we need to apply vs not? thanks

It is for DLA specific case.
See DetectNet_v2 — TAO Toolkit 3.0 documentation
However, the current version of QAT doesnt natively support DLA int8 deployment on Jetson. To deploy this model on Jetson with DLA int8 , use the --force_ptq flag to use TensorRT post-training quantization to generate the calibration cache file.

And Improving INT8 Accuracy Using Quantization Aware Training and the NVIDIA TAO Toolkit | NVIDIA Developer Blog
To deploy this model with the DLA, you must generate the calibration cache file using PTQ on the QAT-trained .tlt model file. You can do this by setting the force_ptq flag over the command line when running export .

we are not using DLA. And also we are having an issue with training without enabling QAT (not QAT enabled). so as I undertand, we dont need to use –force_ptq. Is that correct? Please let me know. thanks alot for the quick response.

Yes, you can ignore “force_ptq”.

1 Like

Hi,
I cannot reproduce 0 mAP against trt int8 engine. You can try with my step.
My step:

  • Run a training with cspdarknet19 backbone(I forget to set to 53, I will try later) with KITTI dataset.
    Only run for 10 epochs. Then get the tlt model.
  • Generate etlt model and trt int8 engine

yolo_v4 export -k nvidia_tlt -m epoch_010.tlt -e spec.txt --engine_file 384_1248.engine --data_type int8 --batch_size 8 --batches 10 --cal_cache_file export/cal.bin --cal_data_file export/cal.tensorfile --cal_image_dir /kitti_path/training/image_2 -o 384_1248.etlt

  • Run evaluation

yolo_v4 evaluate -e spec.txt -m 384_1248.engine

Try with cspdarknet53 backbone, there is also no issue.

thanks alot.

Sure. will try and let you know.

can you please let me know what is the mAP you got with the test set?

About 60%, I just test only 10 epochs for public KITTI dataset.

In this setup, you are using the .engine file generated while running yolo_v4 export which is specific to the machine that run the training and export.

I want to use the .etlt file (384_1248.etlt in above experiment) in another machine and convert it to a .engine file uing tao-converter and then use it for inference. That is where I am facing an issue.

OK, I will use your way to check.

Can you try with my way ? Is it successful?

thank you !

Yes sure. I will let you know

hi,
I tried your method (evalutating the int8 .engine file generated while doing the export using yolo_v4 evaluate ), and I got similar results as what I saw during my previous experiments.

when the model was trained without qat:

  • yolo_v4 evaluate + int8 .engine : mAP was 22%
  • yolo_v4 evaluate .tlt : mAP was 84%

when the model was trained with qat enabled:

  • yolo_v4 evaluate + int8 .engine : mAP was 80%
  • yolo_v4 evaluate .tlt : mAP was 82%

I have this problem at this stage as well. It is hard to undertand whats the issue . As I mentioned before, we also achieved good results with resnet on same data with int8 without qat. Since this works for your dataset on cspdarknet53, it could be a specific issue with our dataset/ calibration dataset. Do you have any suggestions/ thoughts?


the following are logs for exports

exporting model trained without qat

tao-venv) ubuntu@ip-172-31-13-148:~$ tao yolo_v4 export -e /workspace/tlt-experiments/trainings/conf/TAO_yolov4_config_1_70_20_10_split_120.yml -m /workspace/tlt-experiments/trainings/results_TAO_config_1_70_20_10_split_120/weights/yolov4_cspdarknet53_epoch_120.tlt --cal_data_file /workspace/tlt-experiments/trainings/results_TAO_config_1_70_20_10_split_120/cal_cspdarkent53_1160batches.tensorfile --cal_cache_file /workspace/tlt-experiments/trainings/results_TAO_config_1_70_20_10_split_120/cal_cspdarknet53_TAO_1160batches.bin --cal_image_dir /workspace/tlt-experiments/data/calibration/images/ -k key --data_type int8 -o /workspace/tlt-experiments/trainings/results_TAO_config_1_70_20_10_split_120/yolov4_cspdarknet53_epoch_120_TAO_int8_1160batches.etlt --gen_ds_config --verbose --batch_size 8 --batches 1160 --engine_file /workspace/tlt-experiments/trainings/results_TAO_config_1_70_20_10_split_120/int8.engine
2021-09-29 13:21:40,376 [INFO] root: Registry: ['nvcr.io']
2021-09-29 13:21:40,458 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/ubuntu/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
Using TensorFlow backend.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
2021-09-29 13:22:04,064 [INFO] iva.common.export.keras_exporter: Using input nodes: ['Input']
2021-09-29 13:22:04,065 [INFO] iva.common.export.keras_exporter: Using output nodes: ['BatchedNMS']
2021-09-29 13:23:58,016 [DEBUG] modulus.export._uff: Patching keras BatchNormalization...
2021-09-29 13:23:58,016 [DEBUG] modulus.export._uff: Patching keras Dropout...
2021-09-29 13:23:58,016 [DEBUG] modulus.export._uff: Patching UFF TensorFlow converter apply_fused_padding...
2021-09-29 13:24:03,657 [DEBUG] modulus.export._uff: Unpatching keras BatchNormalization layer...
2021-09-29 13:24:03,657 [DEBUG] modulus.export._uff: Unpatching keras Dropout layer...
The ONNX operator number change on the optimization: 771 -> 363
2021-09-29 13:24:42,002 [INFO] keras2onnx: The ONNX operator number change on the optimization: 771 -> 363
2021-09-29 13:24:42,018 [DEBUG] modulus.export._onnx: Model converted to ONNX, checking model validity with onnx.checker.
2021-09-29 13:24:44,874 [DEBUG] iva.common.export.base_exporter: Data file doesn't exist. Pulling input dimensions from the network.
2021-09-29 13:24:44,875 [DEBUG] iva.common.export.keras_exporter: Input dims: (3, 416, 416)
2021-09-29 13:24:45,012 [DEBUG] iva.common.export.tensorfile: Opening /workspace/tlt-experiments/trainings/results_TAO_config_1_70_20_10_split_120/cal_cspdarkent53_1160batches.tensorfile with mode=w
1160it [13:46,  1.40it/s]
2021-09-29 13:38:31,524 [DEBUG] iva.common.export.tensorfile: Opening /workspace/tlt-experiments/trainings/results_TAO_config_1_70_20_10_split_120/cal_cspdarkent53_1160batches.tensorfile with mode=r
2021-09-29 13:38:31,525 [INFO] iva.common.export.keras_exporter: Calibration takes time especially if number of batches is large.
2021-09-29 13:38:45,253 [DEBUG] iva.common.export.base_calibrator: read_calibration_cache - no-op
2021-09-29 13:49:50,744 [DEBUG] iva.common.export.base_calibrator: read_calibration_cache - no-op
2021-09-29 13:49:50,744 [INFO] iva.common.export.base_calibrator: Saving calibration cache (size 11340) to /workspace/tlt-experiments/trainings/results_TAO_config_1_70_20_10_split_120/cal_cspdarknet53_TAO_1160batches.bin
2021-09-29 13:52:48,456 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

exporting model trained with qat
this export creates a calibration cache, but not a .tensor file. Based on this blog, when the model is trained with qat the calibration file is directly generated by extracting information from model rather than following the normal calibration using the provided calibration data.

tao-venv) ubuntu@ip-172-31-13-148:~$ tao yolo_v4 export -e /workspace/tlt-experiments/trainings/conf/TAO_yolov4_config_1_70_20_10_split_120_QAT.yml -m /workspace/tlt-experiments/trainings/results_TAO_config_1_70_20_10_split_120_QAT/weights/yolov4_cspdarknet53_epoch_120.tlt --cal_data_file /workspace/tlt-experiments/trainings/results_TAO_config_1_70_20_10_split_120_QAT/cal_cspdarkent53_1160batches.tensorfile --cal_cache_file /workspace/tlt-experiments/trainings/results_TAO_config_1_70_20_10_split_120_QAT/cal_cspdarknet53_QAT_1160batches.bin --cal_image_dir /workspace/tlt-experiments/data/calibration/images/ -k <key> --data_type int8 -o /workspace/tlt-experiments/trainings/results_TAO_config_1_70_20_10_split_120_QAT/yolov4_cspdarknet53_epoch_120_QAT_int8_1160batches.etlt --gen_ds_config --verbose --batch_size 8 --batches 1160 --engine_file /workspace/tlt-experiments/trainings/results_TAO_config_1_70_20_10_split_120_QAT/int8.engine

2021-09-29 13:57:21,748 [INFO] root: Registry: ['nvcr.io']
2021-09-29 13:57:21,832 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/ubuntu/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
Using TensorFlow backend.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
2021-09-29 13:57:46,072 [INFO] iva.common.export.keras_exporter: Using input nodes: ['Input']
2021-09-29 13:57:46,072 [INFO] iva.common.export.keras_exporter: Using output nodes: ['BatchedNMS']
2021-09-29 14:14:20,777 [DEBUG] modulus.export._uff: Patching keras BatchNormalization...
2021-09-29 14:14:20,778 [DEBUG] modulus.export._uff: Patching keras Dropout...
2021-09-29 14:14:20,779 [DEBUG] modulus.export._uff: Patching UFF TensorFlow converter apply_fused_padding...
2021-09-29 14:14:26,762 [DEBUG] modulus.export._uff: Unpatching keras BatchNormalization layer...
2021-09-29 14:14:26,763 [DEBUG] modulus.export._uff: Unpatching keras Dropout layer...
The ONNX operator number change on the optimization: 771 -> 363
2021-09-29 14:15:07,105 [INFO] keras2onnx: The ONNX operator number change on the optimization: 771 -> 363
2021-09-29 14:15:07,122 [DEBUG] modulus.export._onnx: Model converted to ONNX, checking model validity with onnx.checker.
2021-09-29 14:15:10,003 [DEBUG] iva.common.export.base_exporter: Data file doesn't exist. Pulling input dimensions from the network.
2021-09-29 14:15:10,003 [DEBUG] iva.common.export.keras_exporter: Input dims: (3, 416, 416)
Tensors in scale dictionary but not in network: {'yolo_spp_pool_1/MaxPool:0', 'b4_final_trans/convolution:0', 'yolo_spp_pool_3/MaxPool:0', 'yolo_spp_pool_2/MaxPool:0', 'b3_final_trans/convolution:0'}
2021-09-29 14:20:12,660 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

@Morganh I am going to follow the same steps as you and check the results I get with KITTI dataset. can you please let me know

  • is there a download option of KITTI dataset train/valid/test partitions you used in the experiment without having to download complete dataset from here?(The KITTI Vision Benchmark Suite) , and how you have partitioned your data (train, valid, test)

  • can you please share the spec file you used during training and evaluation.

  • training command you used (to verify mine)

thanks.

Please dowload the notebook according to TAO Toolkit Quick Start Guide — TAO Toolkit 3.0 documentation and follow the part of steps to split the dataset.
187790_spec.txt (2.5 KB)
Training command: yolo_v4 train -e 187790_spec.txt -r result_cspdarknet53 -k nvidia_tlt

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.