FP16 shows slower inference on Turing vs FP32

Dear all,

badly need your advice. I have a yolov5l object detection model (pytorch format) which gives the following inference times on my PC1:

PC1: Ubuntu20.04, RTX2080Ti: inference/image ~ 0.10s (FP16), ~ 0.12s (FP32)
PC1: Windows10, RTX2080Ti: inference/image ~ 0.13s (FP16), ~ 0.15s (FP32)

But trying it on PC2 with the same Turing architecture the picture is completely the opposite:

PC2: Windows10, GTX1660Ti: inference/image ~ 2s (FP16), ~ 0.4s (FP32)

i.e., FP16 becomes about 5 times slower compared to FP32 (!) The same story we get on GTX1650 (also Turing). As far as I know, on Turing this should not happen (e.g. GTX1650 benchmark shows FP16 (half) performance 5.97 TFLOPS compared to 2.98 in FP32). Where can be a problem?

ps. Inference is run on python3.8 with pytorch1.7 using cuda10.2.