Estimating convolution performance with mixed precision


I’m trying to understand how to properly estimate perf for convolutions factoring in the new FP16/INT8/INT4 capabilities.

a) Is it correct that Winograd is pretty much not applicable with INT8/4 or are there tricks implemented in TensorRT/cuDNN? (therefore Winograd can be estimated as cost of filter transform plus fused A GEMM AT)

b) That seems to imply that FP16 Winograd will likely be faster than INT8 (say, both using Turing tensor cores) than direct convolution for 3x3 but for 1x1s it is better to use INT8/INT4?



Int8 is generally not sufficient for winograd. We are able to get accuracy loss down to <1% top-1 of some networks. many are larger.

And the performance advantage of int8 winograd compare to int8 implicit gemm is not great. So we decide int8 winograd is not the way to go.

winograd is best for 3x3. 1x1 implicit gemm or direct convolution is faster.