I want to learn about the manual method to add q/dq layers between operations similar to your tensorrt developer guide. Do you have any examples on how to reproduce the layers in Figure 1 - Figure 10?
My tensorrt engine is slower than the default fp16 engine without q/dq layers because there are a lot and excessive input/output scaling between the operations. From your documentation, it’s recommended to be conservative about adding the q/dq operations, but in the tensorrt examples and quantization source codes, there is no specific function to add or remove q/dq operations between input/output. All operations are by defualt quantize the input, but I want to know how to quantize only in the first layer, keep in running in int8 until the last layer and finally convert the output using dequantize to get the float32 data like in figure 8 or figure 9,