I am observing some slowdown on the first iteration of CNN training.

After GPU is warmed up to 100% of SM frequency, I am calling cuDNN functions for performing convolutional layer computations.

I am performing the same operation several times consequently, and the first iteration in most of the cases (but not always) is slower to up to about several times, compared to the consecutive iterations.

For example, here are the time logs of forward convolution operations:

```
Operation ConvFwd_0: 0.088867ms
Operation ConvFwd_0: 0.052979ms
Operation ConvFwd_0: 0.050049ms
Operation ConvFwd_0: 0.048096ms
Operation ConvFwd_0: 0.051025ms
```

The first iteration time is about 160% of the consecutive.

Another example for a different convolutional layer configuration and backward computations:

```
Operation ConvBwdFilter_0: 24.500000ms
Operation ConvBwdData_0: 0.635010ms
Operation ConvBwdFilter_0: 0.897949ms
Operation ConvBwdData_0: 0.600098ms
Operation ConvBwdFilter_0: 0.883057ms
Operation ConvBwdData_0: 0.588867ms
Operation ConvBwdFilter_0: 0.875977ms
Operation ConvBwdData_0: 0.595947ms
Operation ConvBwdFilter_0: 0.880859ms
Operation ConvBwdData_0: 0.592041ms
```

The first iteration of the backward filter calculation is about 27 times slower than the same consecutive calculations. But this one is rather a rare extreme case.

In my previous question here: cuDNN8: extreamly slow first iteration of CNN training or inference

I also reported a huge slowdown on the first iteration of CNN training, but the slowdown magnitude was much higher. It was caused by cuDNN8 not loading kernels beforehand. I fixed it with a call to cudnnCnnInferVersionCheck().

Now the problem is not with a particular cuDNN version. I am observing similar results with cuDNN8 and cuDNN7 versions. Though with cuDNN8 the slowdown seems to be more prominent.