When doing INT8 calibration and activating verbose output, I can see that TensorRT does the following:
- Parse UFF file
- Run profiling of each layer
- Do int8 calibration
- Do profiling again
- Write plan file to disk.
Why is profiling done twice? This is time consuming and it’s profiling the same layers as before.
Request you to share the system details in the below format along with the model file.
o Linux distro and version
o GPU type
o Nvidia driver version
o CUDA version
o CUDNN version
o Python version [if using python]
o Tensorflow and PyTorch version
o TensorRT version
Also please note that UFF parser has been deprecated from TRT 7 onwards,
Hence we recommend you to use ONNX parser.
- Ubuntu 18.04
- GTX 1050i
- Driver 455.23.05
- CUDA 10.2.109
- CUDNN 7.6.6
- TensorFlow 1.14.0
- TensorRT 220.127.116.11
I know that UFF has been deprecated, but that is irrelevant to this question. Both UFF and ONNX produce a INetworkDefinition when parsed. After that, profiling happens, which is the topic of my question.
When building an INT8 engine, the builder performs the following steps:
- Builds a 32-bit engine, runs it on the calibration set, and records a histogram for each tensor of the distribution of activation values.
- Builds a calibration table from the histograms.
- Builds the INT8 engine from the calibration table and the network definition.
I see, that makes sense.
However, why is profiling needed for the FP32 engine? That engine doesn’t need to be the fastest one, it just has to be “an” engine that can run on FP32. It should be irrelevant which tactic is chosen, to compute the histogram we only care about inputs/outputs to the layer, not its internal implementation, right?
In order to represent 32-bit floating point values and INT 8-bit quantized values, TensorRT needs to understand the dynamic range of each activation tensor. The dynamic range is used to determine the appropriate quantization scale. FP32 engine gives that baseline.
I understand that, but that’s not my question. An FP32 model needs to be built, absolutely.
My question is: why does the fastest FP32 model need to be built? Why is profiling needed to build the FP32 model? The dynamic range of activations is not dependent on how fast the layers run. It’s only dependent on the input/outputs of each layer, therefore profiling should not be needed.
In other words, if I have e.g. a Conv2D layer, I don’t care which one of the N implementations is chosen. In the end of the day it’s still a Conv2D, very well defined mathematically, so given a set of inputs it should produce the same set of outputs, and from those we can determine the dynamic range.
Sorry, but this is handled by internal API/algorithm, which I am not aware of so might not be able to help with additional details.
If in case you are getting any issues with int8 precision during execution please let me know.
Thanks! I was hoping to get some answers from the TRT developers but I guess this is as far as it goes :)
My main issue is build time, currently it takes way too long to build the plan files, so I’m looking for ways to optimize that. Caching introduces correctness problems.