TensorRT inference time much faster than cuDNN


Recently I’ve created small network on cuDNN (2-3 convolution layers) and the same on TensorRT and it looks like TensorRT is faster than cuDNN in 1.8-1.9 times.

For that I have a question, does TensorRT perform implicit conversion of the model to FP16 if it was provided as F32 ? For example, it looks like that TensorRT evaluates model and if difference is not big, it converts model to FP16, in such way improve performance …

Is that a case ?


TensorRT Version: 8
GPU Type: GTX1080
Nvidia Driver Version: 465
CUDA Version: 11.2
CUDNN Version: 8.1

Yes, it converts to FP16 automatically.

Does it mean that if I provide FP32 network, TensorRT could convert whole network to FP16 ?

Is it possible to avoid such behaviour ? For example to enforce working with FP32 precision ? Some option probably ?

Do you mean that happens if if Tensor Core available ??

GTX1080 do not have Tensor Core support, how TensorRT could convert network to FP16, if it will work in emulation mode and will work very slow ?


We do support FP16 even without tensor cores, although it will of course be much faster if tensor cores are available.
Yes, TRT could convert the whole network to FP16.
FP16 is opt-in - the default behavior is to use FP32 precision. Finer-grained control is also possible, i.e. the user can mark specific layers to run in FP32 or FP16

Thank you.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.