cuDNN 8.x.x vs cuDNN 7.6.5 performance drop

There is a significant performance difference between cuDNN 7.6.5 and cuDNN 8.x.x. The program performs sequential calls of cuDNN convolution, batch normalization and activation functions. GPU is fully utilized when the program is using cuDNN 7. But huge time gaps appear between kernel executions with cuDNN 8. (see attached screenshot from Nsight Systems timeline bellow)

CUDA 10.2 with cuDNN 7.6.5 (no gaps, GPU is utilized efficiently)

CUDA 10.2 with cuDNN 8.0.2 (huge time gaps, not efficient GPU utilization)

Same problem exists with different CUDA 11.x and cuDNN 8.x.x versions.

Any ideas what could be the reason of the performance drop?

Hi @nickolay-zerkalny ,
Can you please share the logs with us for the same.



Here is a OneDrive link to a zipped folder with several reports from Nsight Systems 2021.1.3. Profiling was done with different combinations of GPU card, CUDA and cuDNN.


Each report has 5 consecutive inference runs of a CNN neural network:
In report “10.2-8.0.2-rtx-2080ti” look at the timeline around 18.370
In report “10.2-7.6.5-rtx-2080ti” look at the timeline around 12.420
In report “10.0-7.6.5-rtx-2080ti” look at the timeline around 4.200
In report “11.3-8.2.0-rtx-2080ti” look at the timeline around 17.880
and so on.

Logs from Nsight Systems are in the previous message

Hi @nickolay-zerkalny ,
Apologies for the delay, are you still facing the issue.

Hi @AakankshaS ,
Yes, the problem still exists even in the latest cudnn (8.2.2). I suspect that the cudnn 8 is causing those gaps. Cudnn 7.6.5 works fine. Unfortunately, it is impossible to use cudnn 7 on Ampere GPUs.

Hi @nickolay-zerkalny ,
thank you for confirming,
Can you please share with us the logs, looks like the old link got expired.
Apologies for the same.

The same link should work again: