Want to know more about INT8 precision

Hi all, I want to know following details when we configure the option --int8 during trtexec invocation on the command line

  1. I have following clarifications w.r.t the above option
    a. only weight quantization?
    b. only activation quantization?
    c. Dynamic quantization? (where quantization ranges for both weights and activation are computed during the inference dynamically as against fixed)
    d. Hybrid quantization? (where some part of the model are treated with weight only quantized and some part of the model are treated with activation only.
    c. Post training quantization? - where it is a trade-off between model size, inference speed and model accuracy
  2. There is an option to provide the calibration cache file on trtexec command line --calib=. I have following clarifications with this
    a. How does it work in combination with --int8 option.
    b. How this file need to be generated when we already have a pre-trained model.
    c. What if I don’t give this option but only specify --int8 option
    d. The calibration cache generated for one model can be used inferencing other models too?
  3. There are some sample codes related to int8 precision under the directory /usr/src/tensorrt/ and also the source file related to trtexec binary. I tried reading it with respect to all above clarifications but could not understand it properly.
  4. Wanted to have clarifications about which type of quantization it is doing in the sample codes in /usr/src/tensorrt directory w.r.t the clarifications on quantization I have sought.
    Please provide me clarifications w.r.t all questions I have raised, providing me the APIs information or file names of the sample code. Also how we can do a trade-off is done in the sample code between the model size, inference speed and the model accuracy when we specify --int8 option.
    I will be definitely benefitted out of it.

Thanks and Regards

Nagaraj Trivedi

Dear @trivedi.nagaraj,
When you set only --int8 flag, by default, dynamic range is set for all layers with dummy values.
But if calib option(calibration file) is used along with int8, the calibrated data is used to fill the scales.
You can generate the calibration cache using some test data and different calibrators(Developer Guide :: NVIDIA Deep Learning TensorRT Documentation)
As each model have different set of layers and architectures, calibration cache generated for one model can not be used for others.

Thank you SivaRamakrishnan for the clarification.
One point you have mentioned in your reply that when --int8 flag is set dynamic range is set for all layers with dummy values. May I know where in the code for trtexec source code setting the dynamic range is handled (file name and function name) so that I can analyze it in detail. I have seen the source code of trtexec in the sample source code but unable to locate it. If you can point me to that then it will help me a lot. I have also seen the documentation but few things are not clear in that. Particularly with regards to

  1. When we configure the --int8 flag what should be the precision of test image(tensor)? Still it should be fp32 or fp16 or it must be converted to int8?
  2. May I get a sample code to perform dynamic range as you have stated in the reply

Please clarify me on these doubts.

Thanks and Regards

Nagaraj Trivedi

Hi SivaRamaKrishnan, please update me on this.

Thanks and Regards

Nagaraj Trivedi

Dear @trivedi.nagaraj,
The input data will be FP32 even though we set the precision to INT8.
You can check setting of dynamic range(setTensorDynamicRange) in tensorrt/samples/common/sampleEngine.cpp.