Hi everyone,

Not sure if this is the right place to ask this kind of question, but I can’t really find an example of how int8 inference works at runtime. What I know is that, given that we are performing uniform symmetric quantisation, we calibrate the model, i.e. we find the best scale parameters for each weight tensor (channel-wise) and *activations* (that corresponds to the outputs of the activation functions, if I understood correctly). After the calibration process we can quantize the model by applying these scale parameters and clipping che values that end up outside the dynamic range of the given layer. So at this point we have a new Neural Net where all the weights are int8 in the range [-127,127] and some scale parameters for the *activations*.

What I don’t understand is how we perform inference on this new neural network, do we feed the input as float32 or directly as int8? All the computations are always in int8 or sometimes we cast from int8 to float32 and viceversa?

It would be nice to find a real example of e.g. a CONV2D+BIAS+ReLU layer. If you could point me to some useful resources that would be appreciated.

Thanks