How much acceleration in practice by using FP16 against FP32?


I am a new user of xavier, who want to employ instance segmentation network on it. I have learned that, thanks to the new vlota tensor cores, the xavier can supply up to 11 tflops with FP16 theoretically, which is incredible. However, in the example by nvidia (, the performance of mixed precision inference is even bad than that of FP32 when batch size is small. I guess thing may be different in xavier, so i would like to know have you ever compared the real performance between FP16 and FP32 on xavier?

Hi xiy.zhi, we will be presenting deep learning benchmarks in our upcoming webinar on Xavier, recommend that you tune in or watch the on-demand recording afterwards.

It’s unclear if that example is using Tensor Cores in PyTorch or not. Have you tried TensorRT? It is optimized for FP16 / INT8 and Tensor Cores.

Additionally, the ARM8.2 instruction set supported by the Xavier has native support for FP16 in the CPU, which will help when preparing data for upload, so the cost per-batch may go down on the Xavier as opposed to previous solutions.
I’d expect TensorRT to take advantage of this, but libraries that are not specifically optimized for the Xavier probably don’t.