cuDNN8: extreamly slow first iteration of CNN training or inference

pyotr777 · December 15, 2021, 5:24am

Comparing training times of a CNN in cuDNN7 and cuDNN8 environments, I have noticed that the first iteration in cuDNN8 is much slower.

For example, I perform a forward computation of a convolutional layer three times (same data each iteration). Here are the timings:

cuDNN 7.6.5

Time 0.139000 ms
Time 0.133000 ms
Time 0.141000 ms

cuDNN 8.0.4

Time 100.610000 ms
Time 0.208000 ms
Time 0.183000 ms

These are the times of each iteration of a forward convolution with the same configuration on the same machine in two different docker containers. The GPU type is NVIDIA Quardo P2000.

We can see that in the container with cuDNN8, the first iteration is about 500 times slower than the second.

Similar timings on a cloud machine with K80 GPU

cuDNN 7.6.5

Time 0.264000 ms
Time 0.227000 ms
Time 0.214000 ms

cuDNN 8.0.4

Time 122.055000 ms
Time 0.288000 ms
Time 0.246000 ms

Again, in the container with cuDNN8, the first iteration is nearly 500 times slower than the second.

I observed a similar phenomenon on many convolutional layer configurations, but not all. It seems that the effect also depends on the previous cuDNN calls: there was a case when FWD and BWD calculations performed in one go did not have a slowdown of the first iteration, whereas the same BWD calculations without FWD calculations had a significant slowdown in the first iteration.

It also seems to depend on the convolution algorithms.

I wonder what could be the reason for such a slowdown phenomenon.

I attach my test program, which runs FWD computations of a convolutional layer.
Below are the details about my experiment environments and the complete output the program runs.

cuDNN7 environment
nvdriver: 460.91.03,
CUDA:10.2,
cuDNN:7.6.5.32-1

cuDNN8 environment
nvdriver: 460.91.03,
CUDA:11.1,
cuDNN:8.0.4.30-1

Program output on the Quadro P2000 machine
cuDNN8

$ ./fwd_convolution_libcudnn8 10 3 128 28 256 1 0 2
./fwd_convolution_libcudnn8 [N iter C H K S P U] v1.2
N - minibatch size, C - input channels, H - input height/width, K - output channels, S - filter size, P - padding, U - stride
iterations: 3
N: 10
C: 128
HW: 28
K: 256
S: 1
P: 0
U: 2
out_HW: 14

Allocating 4014.1 KB
Allocating 32.8 KB
Allocating 2007.0 KB
out_n: 10
out_c: 256
out_h: 14
out_w: 14

Convolution algorithm: 1

Allocating 133.4 KB
Time 100.610000 ms
Time 0.208000 ms
Time 0.183000 ms
Done

cuDNN7

$ ./fwd_convolution_libcudnn7 10 3 128 28 256 1 0 2
./fwd_convolution_libcudnn7 [N iter C H K S P U] v1.2
N - minibatch size, C - input channels, H - input height/width, K - output channels, S - filter size, P - padding, U - stride
iterations: 3
N: 10
C: 128
HW: 28
K: 256
S: 1
P: 0
U: 2
out_HW: 14

Allocating 4014.1 KB
Allocating 32.8 KB
Allocating 2007.0 KB
out_n: 10
out_c: 256
out_h: 14
out_w: 14

Convolution algorithm: 1

Allocating 1.2 KB
Time 0.139000 ms
Time 0.133000 ms
Time 0.141000 ms
Done

Source code:
convolution.cu (7.8 KB)

yanxu · December 15, 2021, 7:19pm

Hi @pyotr777 , please refer to this note in the documentation
https://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html#faq

Q: Why is cuDNN version 8.0 convolution API call much slower on the first call than subsequent calls?

A: Due to the library split, cuDNN version 8.0 API will only load the necessary kernels on the first API call that requires it. In previous versions, this load would have been observed in the first cuDNN API call that triggers CUDA context initialization, typically cudnnCreate(). In version 8.0, this is delayed until the first sub-library call that triggers CUDA context initialization. Users who desire to have CUDA context preloaded can call the new cudnnCnnInferVersionCheck() API (or its related cousins), which has the side effect of initializing a CUDA context. This will reduce the run time for all subsequent API calls.

Let us know if this resolves your issue!
Also we would recommend you to upgrade to the latest release (v8.3.1 as of today). Many fixes and improvements has went in after 8.0.4.

pyotr777 · December 16, 2021, 1:56am

Thank you!
It worked. I added a call to cudnnCnnInferVersionCheck() to my sample code, and the times on Quadro P2000 in the cuDNN8 environment changed to:

Time 0.231000 ms
Time 0.157000 ms
Time 0.191000 ms

system · December 30, 2021, 1:56am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Slow first iteration of cuDNN convoultion operations cuDNN	2	895	July 1, 2022
Cudnn convolution is significantly slow cuDNN	3	1172	April 19, 2022
The convoluion was slow in cudnn 8.4.0 and 8.4.1 cuDNN	6	904	September 17, 2022
cuDNN runs pretty slow cuDNN	2	1047	April 24, 2023
cuDNN8 regression in algorithm selection heuristics cuDNN	6	2830	April 24, 2021
Performance regression of conv2d INT8 on cudnn 8 cuDNN	2	805	January 14, 2022
cuDNN 8.x.x vs cuDNN 7.6.5 performance drop cuDNN performance	7	1872	August 26, 2021
Is cudnn v8 load library slower than cudnn v7? cuDNN	1	517	March 17, 2021
Cudnn convolution slow since 8.4.0 cuDNN	5	1268	May 31, 2023
Cudnn 7.3 has poor performance on GeForce RTX 2080 cuDNN	0	890	October 12, 2018

cuDNN8: extreamly slow first iteration of CNN training or inference

Q: Why is cuDNN version 8.0 convolution API call much slower on the first call than subsequent calls?

Related topics