Tensorrt inferencing getting failed with custom quantized int 8 TensorFlow model

Hi Team,
We are facing issue with Tensorrt int 8 quantized custom model.

Environment

TensorRT Version: 8.2.5.1
GPU Type: Tesla T4
Nvidia Driver Version: 470.256.02
CUDA Version: 11.4
CUDNN Version: 8.2.1
Operating System + Version: Ubuntu 20.04.6
Python Version (if applicable): 3.8.10
TensorFlow Version (if applicable): 2.8.0

Issue

  • Direct conversion of tensorflow → tensorrt is not possible. So first Converted tensorflow → onnx → tensorrt and quantized it to fp16 and int 8.
  • We can directly quantize tensorrt fp 32 → fp16. For int8 Created calibration file and generated int 8 tensorrt file. But we are not getting expected results and accuracy
  • Final Update → Fp32 and fp16 Tensorrt is working properly → giving expected output, facing issue with int 8.
  • We tried to create calibration cache file and int 8 tensorrt engine file with the original dataset which we used for training. But still, it’s not giving prediction properly.
  • Generated engine file by using below command:
    ./trtexec --onnx=model.onnx --saveEngine=output_engine_int8.trt --int8 --calib=calibration.cache --verbose.

Can you please help us to resolve this issue

Regards
Karthiga N

Hi @karthickacse55 ,
Here are some potential solutions to address the issues you are facing with TensorRT INT8 quantization for your custom model:

  1. Calibration Data in FP32:
  • Ensure that the calibration data is in FP32. TensorRT expects calibration data to be in this format even when using INT8 I/O, which is critical for avoiding precision loss during quantization.
  • Confirm that your FP32 calibration data falls within the range [-128.0F, 127.0F] for accurate conversion to INT8.
  1. Quantization Workflows:
  • Consider applying the post-training quantization (PTQ) workflow to derive scale factors after training the network. This process allows for the measurement of activation distributions and adjustment of scale values based on representative input data.
  • Alternatively, explore quantization-aware training (QAT) to incorporate scale factor computation during training, which helps mitigate quantization effects.
  1. Explicit vs Implicit Quantization:
  • Be aware that implicit quantization is deprecated in TensorRT. Instead, utilize the Quantization Toolkit for explicit quantization, which aids in creating models compatible with TensorRT’s requirements.
  • Use the PTQ recipe in the Quantization Toolkit with PyTorch to generate a pre-quantized model before exporting to ONNX.
  1. Leveraging TensorRT’s PTQ Capability:
  • Take advantage of TensorRT’s facilities for generating calibration caches through implicit quantization, facilitating the conversion of activations and weights to INT8 and FP8 efficiently.
  • Recognize that utilizing QAT or PTQ in a deep learning framework and exporting to ONNX results in an explicitly quantized model, which might influence calibration outcomes.
  1. Utilizing TensorRT’s Quantization Toolkit:
  • Explore the available PyTorch library in TensorRT designed for QAT models and optimization processes with ONNX for streamlined use in TensorRT.
  • Implement PTQ using the toolkit to refine your model prior to exporting it to ONNX for further optimization.

By following these steps and evaluating the calibration process, quantization workflows, and explicit versus implicit quantization, you can effectively address and enhance the accuracy of TensorRT INT8 quantization for your custom model.