Thanks - I have seen these resources and already followed them on my journey to this point.
I am dealing with a specific issue of INT8 speed, and all of these materials report speed ups for INT8 against FP32.
My query is specific to FP16 vs. INT8 for a specific, yet mainstream architecture (YoloV4) and on two specific GPU platforms where I see the same behaviour (xavier nx and RTX 2080 Ti).
Can you advise if what I am observing is normal and is what is expected ?
Thanks @spolisetty - so my impression from all the documentation was that INT8 quantisation forced all layers to INT8 at the expense of performance which is reliant on how well the distribution (dynamic range) of the INT8 quantised layers approximated that of the original FP32 layers.
From what you are saying, you imply that some layers will remain at FP32 (or FP16 if selected) if quantisation at INT8 is a poor approximation to the original FP32. In this case the network would perform inference in a mix of FP32 and INT8. Where is this discussed in the documentation, as I seem to have missed it ?
If this is the case:
How do I easily cycle through the final TRT network to tell which layers are FP32, INT8 … etc ?
How do I control the criteria for the decision “INT8 poor approximation to the original FP32 for this later → don’t use INT8”, and hence force INT8 quantisation for additional/all layers ?
I have not seen either concept in the various samples.