Recently I’ve created small network on cuDNN (2-3 convolution layers) and the same on TensorRT and it looks like TensorRT is faster than cuDNN in 1.8-1.9 times.
For that I have a question, does TensorRT perform implicit conversion of the model to FP16 if it was provided as F32 ? For example, it looks like that TensorRT evaluates model and if difference is not big, it converts model to FP16, in such way improve performance …
We do support FP16 even without tensor cores, although it will of course be much faster if tensor cores are available.
Yes, TRT could convert the whole network to FP16.
FP16 is opt-in - the default behavior is to use FP32 precision. Finer-grained control is also possible, i.e. the user can mark specific layers to run in FP32 or FP16