There is a significant performance difference between cuDNN 7.6.5 and cuDNN 8.x.x. The program performs sequential calls of cuDNN convolution, batch normalization and activation functions. GPU is fully utilized when the program is using cuDNN 7. But huge time gaps appear between kernel executions with cuDNN 8. (see attached screenshot from Nsight Systems timeline bellow)
CUDA 10.2 with cuDNN 7.6.5 (no gaps, GPU is utilized efficiently)
CUDA 10.2 with cuDNN 8.0.2 (huge time gaps, not efficient GPU utilization)
Same problem exists with different CUDA 11.x and cuDNN 8.x.x versions.
Any ideas what could be the reason of the performance drop?