Impact of the First Launched Kernel on Subsequent Ones

Hi,

I have been doing performance benchmarking on RTX 2070 GPU. One thing that I notice is that the performance of the first launched kernels can somehow negatively affect the performance of subsequent ones. As an example:

// Benchmark 1
for (i = 0; i < 1000; ++i) 
  kernel_A<<<...>>>();
cudaDeviceSynchronize();
for (i = 0; i < 1000; ++i) 
  kernel_B<<<...>>>();

// Benchmark 2
for (i = 0; i < 1000; ++i) 
  kernel_B<<<...>>>();
cudaDeviceSynchronize();
for (i = 0; i < 1000; ++i) 
  kernel_B<<<...>>>();

// Benchmark 3
for (i = 0; i < 1000; ++i) 
  kernel_C<<<...>>>();
cudaDeviceSynchronize();
for (i = 0; i < 1000; ++i) 
  kernel_B<<<...>>>();

// Benchmark 4
for (i = 0; i < 1000; ++i) 
  kernel_B<<<...>>>();

The performance numbers reported by nvprof is roughly the same on Benchmark 2, 3, and 4, whereas the numbers on Benchmark 1 is roughly 10% worse than the others. I suspect the reason is because kernel_A has low SM occupancy in its execution (in its final execution wave, it is only able to dispatch one thread block per SM), but I am not sure how this can affect the performance of kernel_B, especially given that there is a cudaDeviceSynchronize call in between.

Could anyone please give me some hints on how to mitigate this problem? Thanks.

Sorry but it seems that I put it into the wrong category. This is not directly related to the libraries. I will re-post on the CUDA performance discussion board.

Sorry again for the confusion.