Hi all, I want to know following details when we configure the option --int8 during trtexec invocation on the command line
- I have following clarifications w.r.t the above option
a. only weight quantization?
b. only activation quantization?
c. Dynamic quantization? (where quantization ranges for both weights and activation are computed during the inference dynamically as against fixed)
d. Hybrid quantization? (where some part of the model are treated with weight only quantized and some part of the model are treated with activation only.
c. Post training quantization? - where it is a trade-off between model size, inference speed and model accuracy - There is an option to provide the calibration cache file on trtexec command line --calib=. I have following clarifications with this
a. How does it work in combination with --int8 option.
b. How this file need to be generated when we already have a pre-trained model.
c. What if I don’t give this option but only specify --int8 option
d. The calibration cache generated for one model can be used inferencing other models too? - There are some sample codes related to int8 precision under the directory /usr/src/tensorrt/ and also the source file related to trtexec binary. I tried reading it with respect to all above clarifications but could not understand it properly.
- Wanted to have clarifications about which type of quantization it is doing in the sample codes in /usr/src/tensorrt directory w.r.t the clarifications on quantization I have sought.
Please provide me clarifications w.r.t all questions I have raised, providing me the APIs information or file names of the sample code. Also how we can do a trade-off is done in the sample code between the model size, inference speed and the model accuracy when we specify --int8 option.
I will be definitely benefitted out of it.
Thanks and Regards
Nagaraj Trivedi