Hi @alex247, sure - personally, I always run models in FP16 on Jetson (with TensorRT) if INT8 model/calibration isn’t available. INT8 will give even higher performance on Xavier/Orin than FP16, but INT8 requires the calibration table and typically Quantization-Aware Training (QAT) for best results - whereas FP16 you can just run from any typical FP32 model through TensorRT without extra steps and still get good performance/accuracy. The TAO pre-trained models come with INT8 ready to go, though.
For FP16 inference, you should just be able to export your normal FP32 model to ONNX without needing to do anything to it. TensorRT will handle the FP16 conversion internally, including the input/output tensors.
That’s up to you I suppose, but it’s not required and I haven’t personally done it, although I believe you can use AMP (Automatic Mixed Precision) training if you want to speed up the process. I typically just train in normal FP32 mode and run in FP16 during inference time.
Just to confirm, if we are already converting a 32 Onnx into TRT, does that mean that we are by default running in FP16? Or is this something that needs to be enabled somewhere? If it is enabled by default, how can we go back to FP32 to measure the difference?