Comparing training times of a CNN in cuDNN7 and cuDNN8 environments, I have noticed that the first iteration in cuDNN8 is much slower.
For example, I perform a forward computation of a convolutional layer three times (same data each iteration). Here are the timings:
cuDNN 7.6.5
Time 0.139000 ms
Time 0.133000 ms
Time 0.141000 ms
cuDNN 8.0.4
Time 100.610000 ms
Time 0.208000 ms
Time 0.183000 ms
These are the times of each iteration of a forward convolution with the same configuration on the same machine in two different docker containers. The GPU type is NVIDIA Quardo P2000.
We can see that in the container with cuDNN8, the first iteration is about 500 times slower than the second.
Similar timings on a cloud machine with K80 GPU
cuDNN 7.6.5
Time 0.264000 ms
Time 0.227000 ms
Time 0.214000 ms
cuDNN 8.0.4
Time 122.055000 ms
Time 0.288000 ms
Time 0.246000 ms
Again, in the container with cuDNN8, the first iteration is nearly 500 times slower than the second.
I observed a similar phenomenon on many convolutional layer configurations, but not all. It seems that the effect also depends on the previous cuDNN calls: there was a case when FWD and BWD calculations performed in one go did not have a slowdown of the first iteration, whereas the same BWD calculations without FWD calculations had a significant slowdown in the first iteration.
It also seems to depend on the convolution algorithms.
I wonder what could be the reason for such a slowdown phenomenon.
I attach my test program, which runs FWD computations of a convolutional layer.
Below are the details about my experiment environments and the complete output the program runs.
cuDNN7 environment
nvdriver: 460.91.03,
CUDA:10.2,
cuDNN:7.6.5.32-1
cuDNN8 environment
nvdriver: 460.91.03,
CUDA:11.1,
cuDNN:8.0.4.30-1
Program output on the Quadro P2000 machine
cuDNN8
$ ./fwd_convolution_libcudnn8 10 3 128 28 256 1 0 2
./fwd_convolution_libcudnn8 [N iter C H K S P U] v1.2
N - minibatch size, C - input channels, H - input height/width, K - output channels, S - filter size, P - padding, U - stride
iterations: 3
N: 10
C: 128
HW: 28
K: 256
S: 1
P: 0
U: 2
out_HW: 14
Allocating 4014.1 KB
Allocating 32.8 KB
Allocating 2007.0 KB
out_n: 10
out_c: 256
out_h: 14
out_w: 14
Convolution algorithm: 1
Allocating 133.4 KB
Time 100.610000 ms
Time 0.208000 ms
Time 0.183000 ms
Done
cuDNN7
$ ./fwd_convolution_libcudnn7 10 3 128 28 256 1 0 2
./fwd_convolution_libcudnn7 [N iter C H K S P U] v1.2
N - minibatch size, C - input channels, H - input height/width, K - output channels, S - filter size, P - padding, U - stride
iterations: 3
N: 10
C: 128
HW: 28
K: 256
S: 1
P: 0
U: 2
out_HW: 14
Allocating 4014.1 KB
Allocating 32.8 KB
Allocating 2007.0 KB
out_n: 10
out_c: 256
out_h: 14
out_w: 14
Convolution algorithm: 1
Allocating 1.2 KB
Time 0.139000 ms
Time 0.133000 ms
Time 0.141000 ms
Done
Source code:
convolution.cu (7.8 KB)