Serving with Pytorch and TF32/FP16

I have a serving Bert module. If I profile it I see most of its compute time is volta_sgemm_128_64_tn.
If I understand correctly, that means that the input is FP32 and no Tensor Cores are used.
Is it possible to configure/what needs to be changed in order for the serving to be done on TF32/TF16 and utilize the Tensor cores of the GPU?