TensorRT 8-bit Quantization questions

Hi, recently I studied the 8-bit quantization, but I have a few questions:

  1. How to quantize weights to INT8 data?
  2. How the weights_scale are stored in the “pseudocode for the INT8 conv kernel”?

I have already studied the “8-bit inference with TensorRT” ppt, and TensorRT developer guide, and also some other resources on the web, but I still can not find a clear answer, so could someone give some help to answer these questions?

Thanks!

I might be able to help with the first question.

The following process will not only quantize your weights to int8, but it will also run your convolutions in int8 which will give you a nice speedup.

  1. Extend the IInt8EntropyCalibrator class via either python or C++
  2. Provide your builder your calibration class via setInt8Calibrator in C++ or equivalent python
  3. set int8 mode via nvinfer1::IBuilder::setInt8Mode(true) in C++ or equivalent python
  4. Provide calibration data via your custom class to the builder
  5. Build your tensorrt execution engine as per usual.

Thanks for your answer :)

However, the main question is that I don’t know how TensorRT quantize weights. I note that when creating engine by using “tensorrt.utils.caffe_to_trt_engine” or set parameter for builder by using “builder->setInt8Mode(true)”, the INT8 mode or data type are set. Thus I don’t know when and how the weights are quantized in this TensorRT framework. And I couldn’t find any references from nVidia.

Hi, Zhonggang, please check this ppt.

on page 21
I think TensorRT just use the No SATURATION quantization method to quantize weights(check 8-bit inference with TensorRT ppt page 12 )

Wow, thanks very much! The first ppt in slideshare is exactly what I need and it really solves my problem. Page 15 in the 8 bit inference ppt mentioned that Saturate quantization of weights has no accuracy improvement, but no official document or source code declare the quantization method for weights clearly. However, I think the ppt you shared is an official evidence of the quantization method for weights.Thank you very much!

hello @zhonggang @han_qiu
would you share the ppt with me?because i cannot visit the website,thanks a lot!Appreciate!

Quite sorry that I watch this ppt online in company, for some security problem I could not able to download it. Maybe @han_qiu could give you some help.

We created a new “Deep Learning Training and Inference” section in Devtalk to improve the experience for deep learning and accelerated computing, and HPC users:
https://devtalk.nvidia.com/default/board/301/deep-learning-training-and-inference-/

We are moving active deep learning threads to the new section.

URLs for topics will not change with the re-categorization. So your bookmarks and links will continue to work as earlier.

-Siddharth