Hi, recently I studied the 8-bit quantization, but I have a few questions:
How to quantize weights to INT8 data?
How the weights_scale are stored in the “pseudocode for the INT8 conv kernel”?
I have already studied the “8-bit inference with TensorRT” ppt, and TensorRT developer guide, and also some other resources on the web, but I still can not find a clear answer, so could someone give some help to answer these questions?
However, the main question is that I don’t know how TensorRT quantize weights. I note that when creating engine by using “tensorrt.utils.caffe_to_trt_engine” or set parameter for builder by using “builder->setInt8Mode(true)”, the INT8 mode or data type are set. Thus I don’t know when and how the weights are quantized in this TensorRT framework. And I couldn’t find any references from nVidia.
Wow, thanks very much! The first ppt in slideshare is exactly what I need and it really solves my problem. Page 15 in the 8 bit inference ppt mentioned that Saturate quantization of weights has no accuracy improvement, but no official document or source code declare the quantization method for weights clearly. However, I think the ppt you shared is an official evidence of the quantization method for weights.Thank you very much!
Quite sorry that I watch this ppt online in company, for some security problem I could not able to download it. Maybe @han_qiu could give you some help.