Hi,
We have a model with FP32 inputs and outputs, and want to run it on DLA in INT8 mode.
If we ran it on GPU, TensorRT would insert reformating layers FP32 → INT8 at the input, and INT8 → FP32 at the output. However, this is not done on DLA, and the client is responsible for doing the appropriate conversion.
How is this done in practice? The conversion FP32 → INT8 requires appropriate scale factors. Where are they obtained from? Should they be obtained from the Calibration Table that comes out of the engine build process? If not, is there a “standard practice” on how to do it?
Thanks!