Practical aspects about neural networks quantization with TensorRT


I am currently exploring the topic of deep learning model quantization techniques. In the official NVIDIA’s Tensor RT documentation, we can see that Tensor RT supports quantization and applies it to activation and weights of the provided model.

I am working with a repository that proposes explicit (with the pytorch_quantization library) and implicit quantization (implemented by Builder object from Python API).

Explicit and Implicit Quantization

I would like to better understand the details described in the Explicit Versus Implicit Quantization section.

Explicit Quantization

When performing explicit quantization, I provide an ONNX graph with Q/DQ nodes to be converted to a ‘.trt’ file. Can I assume that all fake quantization nodes (as seen by Netron, for example) will be replaced by real INT8 operations of the associated layers?

Implicit Quantization

Moreover, the following statement from the above documentation doesn’t provide too much detail about which operations are quantized or not in implicit quantization:

“When processing implicitly quantized networks, TensorRT treats the model as a floating-point model when applying the graph optimizations, and uses INT8 opportunistically to optimize layer execution time. If a layer runs faster in INT8, then it executes in INT8. Otherwise, FP32 or FP16 is used. In this mode, TensorRT is optimizing for performance only, and you have little control over where INT8 is used - even if you explicitly set the precision of a layer at the API level, TensorRT may fuse that layer with another during graph optimization, and lose the information that it must execute in INT8. TensorRT’s PTQ capability generates an implicitly quantized network.”

Possible Strategies

What strategies can I use for profiling an inference run from a serialized network generated by TensorRT?

I have already tried Nsight Systems and trtexec with profiling options, but I only get timing information with these two tools. Would you have another profiling approach that could allow me to verify the operation precision in every model layer during an evaluation process?

Can you try running your model with trtexec command, and share the “”–verbose"" log in case if the issue persist

You can refer below link for all the supported operators list, in case any operator is not supported you need to create a custom plugin to support that operation

Also, request you to share your model and script if not shared already so that we can help you better.

Meanwhile, for some common errors and queries please refer to below link: