I notice that cudnnConvolutionBiasActivation forward can be much slower than the corresponding cudnnConvolutionForward call on some input sizes. For depthwise convolution (group size = input channel), input channel = output channel = 512 and image dimension 14, stride 1 for example, on the jetson nano, the first call is more than 50 times slower than the second call for the same algorithm. This renders the fused op useless in this case. Has anybody else encountered similar things?
Could you please share the repro script so we can help better.
Also, please provide details on the platforms you are using:
o Linux distro and version
o GPU type
o Nvidia driver version
o CUDA version
o CUDNN version
o Python version [if using python]
o Tensorflow and PyTorch version
o TensorRT version