Model quantization into fp16

sarthakgarg0303 · June 29, 2025, 6:13pm

Description

I’m trying to quantize a model to reduce the inference time, model exists in fp32 with its layers weights in fp32 limit, during quantization in trt/onnx the output quality gets highly degraded, suggest me the best way to quantize the model with maintaining the output quality.

Environment

TensorRT Version: 10.0+
GPU Type: RTX 3090
Nvidia Driver Version:
CUDA Version: 12.4
CUDNN Version:
Operating System + Version: Ubuntu 22.04
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Using DEFOM-stereo model for this quantization

AakankshaS · June 30, 2025, 5:06pm

Hi @sarthakgarg0303 ,
To minimize the impact of quantization on output quality, follow these best practices:

Calibrate your model: Calibration is the process of collecting statistics about the activations and weights of your model. This helps the quantization algorithm to determine the optimal scaling factors for the weights and activations. Make sure to calibrate your model using a representative dataset.
Use a suitable quantization scheme: There are several quantization schemes available, including:

Post-training quantization (PTQ): This is the simplest and fastest method, but it may not always produce the best results.
Quantization-aware training (QAT): This method involves training the model with simulated quantization noise, which can help the model adapt to the quantized precision.
Knowledge distillation: This method involves training a smaller, quantized model to mimic the behavior of the original, full-precision model.

Choose the right precision: Experiment with different precisions (e.g., int8, uint8, fp16) to find the best trade-off between accuracy and performance.
Use a suitable activation function: Some activation functions, like ReLU, are more robust to quantization than others.
Monitor and adjust: Monitor the output quality and adjust the quantization parameters as needed.

Specific Recommendations for DEFOM-stereo Model

Since you’re using the DEFOM-stereo model, which is a stereo matching algorithm, I recommend the following:

Use QAT: QAT can help the model adapt to the quantized precision, which is particularly important for stereo matching algorithms that rely on precise calculations.
Calibrate using a stereo dataset: Calibrate your model using a representative stereo dataset to ensure that the quantization algorithm captures the statistics of the activations and weights accurately.
Experiment with different precisions: Try different precisions (e.g., int8, uint8, fp16) to find the best trade-off between accuracy and performance.
Monitor the output quality: Monitor the output quality and adjust the quantization parameters as needed.

Additional Tips

Use the TensorRT optimization tools: TensorRT provides optimization tools, such as the TensorRT Optimizer, which can help you optimize your model for inference.
Use the ONNX Runtime: The ONNX Runtime provides a set of tools and APIs for optimizing and deploying ONNX models.
Consider using a more advanced quantization algorithm: There are several advanced quantization algorithms available, such as the TensorFlow Model Optimization Toolkit, which can provide better results than the standard quantization algorithms.

Topic		Replies	Views
Achieving FP32 Accuracy for INT8 Inference Using Quantization Aware Training with NVIDIA TensorRT Technical Blog	1	843	December 3, 2023
Tensorrt inferencing getting failed with custom quantized int 8 TensorFlow model TensorRT tensorrt , ubuntu , python , cudnn	1	32	March 28, 2025
Cannot create the calibration cache for the QAT model in tensorRT 8 TensorRT tensorrt , pytorch , calibration , onnx	4	2270	March 15, 2022
LLM model Quantization approach using tensor RT TensorRT cudnn	1	22	June 30, 2025
Reducing inference time for a custom model TensorRT	1	53	December 31, 2024
Sparsity in INT8: Training Workflow and Best Practices for NVIDIA TensorRT Acceleration Technical Blog	0	382	May 26, 2023
How exactly are you supposed to do explicit quantization? TensorRT	1	122	March 4, 2025
Post quantization aware training is slower than fp16 and post quantization TensorRT	12	2689	September 25, 2024
After TensorRT quantize DETR model mAP nealy zero TensorRT tensorrt , cudnn	0	45	October 12, 2024
Accelerating Quantized Networks with the NVIDIA QAT Toolkit for TensorFlow and NVIDIA TensorRT Technical Blog	0	396	June 16, 2022

Model quantization into fp16

Description

Environment

Related topics