TensorRT 8-bit Quantization questions

zhonggang · March 17, 2018, 8:53am

Hi, recently I studied the 8-bit quantization, but I have a few questions:

How to quantize weights to INT8 data?
How the weights_scale are stored in the “pseudocode for the INT8 conv kernel”?

I have already studied the “8-bit inference with TensorRT” ppt, and TensorRT developer guide, and also some other resources on the web, but I still can not find a clear answer, so could someone give some help to answer these questions?

Thanks!

moodie · March 22, 2018, 2:54pm

I might be able to help with the first question.

The following process will not only quantize your weights to int8, but it will also run your convolutions in int8 which will give you a nice speedup.

Extend the IInt8EntropyCalibrator class via either python or C++
Provide your builder your calibration class via setInt8Calibrator in C++ or equivalent python
set int8 mode via nvinfer1::IBuilder::setInt8Mode(true) in C++ or equivalent python
Provide calibration data via your custom class to the builder
Build your tensorrt execution engine as per usual.

zhonggang · March 23, 2018, 7:33am

Thanks for your answer :)

However, the main question is that I don’t know how TensorRT quantize weights. I note that when creating engine by using “tensorrt.utils.caffe_to_trt_engine” or set parameter for builder by using “builder->setInt8Mode(true)”, the INT8 mode or data type are set. Thus I don’t know when and how the weights are quantized in this TensorRT framework. And I couldn’t find any references from nVidia.

han_qiu · March 28, 2018, 3:18am

Hi, Zhonggang, please check this ppt.

on page 21
I think TensorRT just use the No SATURATION quantization method to quantize weights(check 8-bit inference with TensorRT ppt page 12 )

zhonggang · March 28, 2018, 6:06am

Wow, thanks very much! The first ppt in slideshare is exactly what I need and it really solves my problem. Page 15 in the 8 bit inference ppt mentioned that Saturate quantization of weights has no accuracy improvement, but no official document or source code declare the quantization method for weights clearly. However, I think the ppt you shared is an official evidence of the quantization method for weights.Thank you very much!

zhou-lw · March 31, 2018, 3:06am

hello @zhonggang @han_qiu
would you share the ppt with me?because i cannot visit the website,thanks a lot!Appreciate!

zhonggang · April 2, 2018, 12:25am

Quite sorry that I watch this ppt online in company, for some security problem I could not able to download it. Maybe @han_qiu could give you some help.

SiddharthSharma_TPM · April 26, 2018, 11:34pm

We created a new “Deep Learning Training and Inference” section in Devtalk to improve the experience for deep learning and accelerated computing, and HPC users:
https://devtalk.nvidia.com/default/board/301/deep-learning-training-and-inference-/

We are moving active deep learning threads to the new section.

URLs for topics will not change with the re-categorization. So your bookmarks and links will continue to work as earlier.

-Siddharth

Topic		Replies	Views
Can TensorRT 7.1.3 convert an INT8 pytorch QAT model to engine? TensorRT	3	729	April 21, 2022
TensorRT INT8 Quantization : weights + activations quantization TensorRT	4	2060	February 13, 2020
Tensorrt inferencing getting failed with custom quantized int 8 TensorFlow model TensorRT tensorrt , ubuntu , python , cudnn	1	15	March 28, 2025
How to do int8 calibration in c++ in tensorRT 5 ? TensorRT	10	4774	October 12, 2021
Achieving FP32 Accuracy for INT8 Inference Using Quantization Aware Training with NVIDIA TensorRT Technical Blog	1	832	December 3, 2023
IInt8EntropyCalibrator TensorRT	2	1123	September 4, 2018
TensorRT 5 Int8 Calibration Example TensorRT	11	7687	October 12, 2021
Int8 quantization TensorRT	1	500	December 16, 2021
TensorRT 4.0.1 - Int8 precision Vs. FP32 precision objects detections inference results TensorRT	12	3405	December 1, 2019
TensorRT TensorRT tensorrt , python	1	317	October 27, 2021

TensorRT 8-bit Quantization questions

Related topics