Tensorrt inferencing getting failed with custom quantized int 8 TensorFlow model

karthickacse55 · March 25, 2025, 6:41am

Hi Team,
We are facing issue with Tensorrt int 8 quantized custom model.

Environment

TensorRT Version: 8.2.5.1
GPU Type: Tesla T4
Nvidia Driver Version: 470.256.02
CUDA Version: 11.4
CUDNN Version: 8.2.1
Operating System + Version: Ubuntu 20.04.6
Python Version (if applicable): 3.8.10
TensorFlow Version (if applicable): 2.8.0

Issue

Direct conversion of tensorflow → tensorrt is not possible. So first Converted tensorflow → onnx → tensorrt and quantized it to fp16 and int 8.
We can directly quantize tensorrt fp 32 → fp16. For int8 Created calibration file and generated int 8 tensorrt file. But we are not getting expected results and accuracy
Final Update → Fp32 and fp16 Tensorrt is working properly → giving expected output, facing issue with int 8.
We tried to create calibration cache file and int 8 tensorrt engine file with the original dataset which we used for training. But still, it’s not giving prediction properly.
Generated engine file by using below command:
./trtexec --onnx=model.onnx --saveEngine=output_engine_int8.trt --int8 --calib=calibration.cache --verbose.

Can you please help us to resolve this issue

Regards
Karthiga N

AakankshaS · March 28, 2025, 7:35pm

Hi @karthickacse55 ,
Here are some potential solutions to address the issues you are facing with TensorRT INT8 quantization for your custom model:

Calibration Data in FP32:

Ensure that the calibration data is in FP32. TensorRT expects calibration data to be in this format even when using INT8 I/O, which is critical for avoiding precision loss during quantization.
Confirm that your FP32 calibration data falls within the range [-128.0F, 127.0F] for accurate conversion to INT8.

Quantization Workflows:

Consider applying the post-training quantization (PTQ) workflow to derive scale factors after training the network. This process allows for the measurement of activation distributions and adjustment of scale values based on representative input data.
Alternatively, explore quantization-aware training (QAT) to incorporate scale factor computation during training, which helps mitigate quantization effects.

Explicit vs Implicit Quantization:

Be aware that implicit quantization is deprecated in TensorRT. Instead, utilize the Quantization Toolkit for explicit quantization, which aids in creating models compatible with TensorRT’s requirements.
Use the PTQ recipe in the Quantization Toolkit with PyTorch to generate a pre-quantized model before exporting to ONNX.

Leveraging TensorRT’s PTQ Capability:

Take advantage of TensorRT’s facilities for generating calibration caches through implicit quantization, facilitating the conversion of activations and weights to INT8 and FP8 efficiently.
Recognize that utilizing QAT or PTQ in a deep learning framework and exporting to ONNX results in an explicitly quantized model, which might influence calibration outcomes.

Utilizing TensorRT’s Quantization Toolkit:

Explore the available PyTorch library in TensorRT designed for QAT models and optimization processes with ONNX for streamlined use in TensorRT.
Implement PTQ using the toolkit to refine your model prior to exporting it to ONNX for further optimization.

By following these steps and evaluating the calibration process, quantization workflows, and explicit versus implicit quantization, you can effectively address and enhance the accuracy of TensorRT INT8 quantization for your custom model.

Topic		Replies	Views
Quantization of D-FINE in tensorrt 10.8 fails TensorRT tensorrt , cudnn , onnx	3	62	April 30, 2025
Converting to TRT a model from Quantization Aware Training without applying calibration TensorRT	5	1717	February 2, 2021
INT8 calibration file not generating, not building in INT8 mode TensorRT tensorrt , ubuntu , python , jetson-nano	15	2440	June 4, 2022
Achieving FP32 Accuracy for INT8 Inference Using Quantization Aware Training with NVIDIA TensorRT Technical Blog	1	835	December 3, 2023
Does TensorRT 8.6.1 support INT8 quantization for HardSwish? TensorRT cudnn	6	443	October 21, 2023
Some questions about TensorRT INT8, PTQ and QAT TensorRT tensorrt	5	1832	December 27, 2021
YOLOX - Quantize int8 and convert to TensorRT engine TensorRT tensorrt , jetson-inference , python	3	2030	September 8, 2023
Fake quantization ONNX model parse ERROR using TensorRT 8 TensorRT	3	792	September 27, 2021
Practical aspects about neural networks quantization with TensorRT TensorRT tensorrt	1	803	March 31, 2023
Tenssorrt INT8 precision engine build failed for the models having custom layer (BatchedNMSDynamic_TRT) TensorRT	11	1921	June 29, 2021

Tensorrt inferencing getting failed with custom quantized int 8 TensorFlow model

Environment

Issue

Related topics