I build int8 engine by using tensorrt Python API or trtexec. There are some fp32 and fp16 layers in my generated int8 model. How can I enforce convert all layers to INT8 when building int8 engine model? I want to check acc and speed of engine model with all INT8 layers. Thanks
Hi, Please refer to the below links to perform inference in INT8
Thanks. I can run inference on INT8 engine. I mean that is there any way to force all fp32 layers (from onnx) to int8 layers in building process?
Are you using implicit quantization or explicit quantization?
I use implicit quantization with batch size = 1.
We need to use the setPrecision and setOutputType APIs to force specific layers to use int8. TRT chooses FP32/FP16 over INT8 because at small batches, these run faster than INT8 (batch=1 some linear ops are faster on SM than TC, and using INT8 will require extra computation for quantization).