cuDNN8: extreamly slow first iteration of CNN training or inference

Comparing training times of a CNN in cuDNN7 and cuDNN8 environments, I have noticed that the first iteration in cuDNN8 is much slower.

For example, I perform a forward computation of a convolutional layer three times (same data each iteration). Here are the timings:

cuDNN 7.6.5

Time 0.139000 ms
Time 0.133000 ms
Time 0.141000 ms

cuDNN 8.0.4

Time 100.610000 ms
Time 0.208000 ms
Time 0.183000 ms

These are the times of each iteration of a forward convolution with the same configuration on the same machine in two different docker containers. The GPU type is NVIDIA Quardo P2000.

We can see that in the container with cuDNN8, the first iteration is about 500 times slower than the second.

Similar timings on a cloud machine with K80 GPU

cuDNN 7.6.5

Time 0.264000 ms
Time 0.227000 ms
Time 0.214000 ms

cuDNN 8.0.4

Time 122.055000 ms
Time 0.288000 ms
Time 0.246000 ms

Again, in the container with cuDNN8, the first iteration is nearly 500 times slower than the second.

I observed a similar phenomenon on many convolutional layer configurations, but not all. It seems that the effect also depends on the previous cuDNN calls: there was a case when FWD and BWD calculations performed in one go did not have a slowdown of the first iteration, whereas the same BWD calculations without FWD calculations had a significant slowdown in the first iteration.

It also seems to depend on the convolution algorithms.

I wonder what could be the reason for such a slowdown phenomenon.

I attach my test program, which runs FWD computations of a convolutional layer.
Below are the details about my experiment environments and the complete output the program runs.

cuDNN7 environment
nvdriver: 460.91.03,
CUDA:10.2,
cuDNN:7.6.5.32-1

cuDNN8 environment
nvdriver: 460.91.03,
CUDA:11.1,
cuDNN:8.0.4.30-1

Program output on the Quadro P2000 machine
cuDNN8

$ ./fwd_convolution_libcudnn8 10 3 128 28 256 1 0 2
./fwd_convolution_libcudnn8 [N iter C H K S P U] v1.2
N - minibatch size, C - input channels, H - input height/width, K - output channels, S - filter size, P - padding, U - stride
iterations: 3
N: 10
C: 128
HW: 28
K: 256
S: 1
P: 0
U: 2
out_HW: 14

Allocating 4014.1 KB
Allocating 32.8 KB
Allocating 2007.0 KB
out_n: 10
out_c: 256
out_h: 14
out_w: 14

Convolution algorithm: 1

Allocating 133.4 KB
Time 100.610000 ms
Time 0.208000 ms
Time 0.183000 ms
Done

cuDNN7

$ ./fwd_convolution_libcudnn7 10 3 128 28 256 1 0 2
./fwd_convolution_libcudnn7 [N iter C H K S P U] v1.2
N - minibatch size, C - input channels, H - input height/width, K - output channels, S - filter size, P - padding, U - stride
iterations: 3
N: 10
C: 128
HW: 28
K: 256
S: 1
P: 0
U: 2
out_HW: 14

Allocating 4014.1 KB
Allocating 32.8 KB
Allocating 2007.0 KB
out_n: 10
out_c: 256
out_h: 14
out_w: 14

Convolution algorithm: 1

Allocating 1.2 KB
Time 0.139000 ms
Time 0.133000 ms
Time 0.141000 ms
Done

Source code:
convolution.cu (7.8 KB)

Hi @pyotr777 , please refer to this note in the documentation
https://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html#faq

Q: Why is cuDNN version 8.0 convolution API call much slower on the first call than subsequent calls?

A: Due to the library split, cuDNN version 8.0 API will only load the necessary kernels on the first API call that requires it. In previous versions, this load would have been observed in the first cuDNN API call that triggers CUDA context initialization, typically cudnnCreate(). In version 8.0, this is delayed until the first sub-library call that triggers CUDA context initialization. Users who desire to have CUDA context preloaded can call the new cudnnCnnInferVersionCheck() API (or its related cousins), which has the side effect of initializing a CUDA context. This will reduce the run time for all subsequent API calls.

Let us know if this resolves your issue!
Also we would recommend you to upgrade to the latest release (v8.3.1 as of today). Many fixes and improvements has went in after 8.0.4.

Thank you!
It worked. I added a call to cudnnCnnInferVersionCheck() to my sample code, and the times on Quadro P2000 in the cuDNN8 environment changed to:

Time 0.231000 ms
Time 0.157000 ms
Time 0.191000 ms

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.