Impact of the First Launched Kernel on Subsequent Ones

Hi,

I have been doing performance benchmarking on RTX 2070 GPU. One thing that I notice is that the performance of the first launched kernels can somehow negatively affect the performance of subsequent ones. As an example:

// Benchmark 1
for (i = 0; i < 1000; ++i) 
  kernel_A<<<...>>>();
cudaDeviceSynchronize();
for (i = 0; i < 1000; ++i) 
  kernel_B<<<...>>>();

// Benchmark 2
for (i = 0; i < 1000; ++i) 
  kernel_B<<<...>>>();
cudaDeviceSynchronize();
for (i = 0; i < 1000; ++i) 
  kernel_B<<<...>>>();

// Benchmark 3
for (i = 0; i < 1000; ++i) 
  kernel_C<<<...>>>();
cudaDeviceSynchronize();
for (i = 0; i < 1000; ++i) 
  kernel_B<<<...>>>();

// Benchmark 4
for (i = 0; i < 1000; ++i) 
  kernel_B<<<...>>>();

The performance numbers reported by nvprof is roughly the same on Benchmark 2, 3, and 4, whereas the numbers on Benchmark 1 is roughly 10% worse than the others. I suspect the reason is because kernel_A has low SM occupancy in its execution (in its final execution wave, it is only able to dispatch one thread block per SM), but I am not sure how this can affect the performance of kernel_B, especially given that there is a cudaDeviceSynchronize call in between.

Could anyone please give me some hints on how to mitigate this problem? Thanks.

It is hard to diagnose such an issue from a very generic description. Since you are asking for hints, consider (1) warm-up effects (2) dynamic clocking of the GPU. The first can be addressed by using proper benchmarking methodolody (e.g. do ten runs and report time of the fastest), the second can be addressed by fixing GPU clocks via nvidia-smi if your hardware supports it (in my experience, that is only the case for professional GPUs, not consumer models). After applying these measures, any deviations below 2% should be considered measurement noise.

@njuffa Thank you so much for your reply.

GPU trace without cuBLAS.

GPU trace with a single cuBLAS call.

After diving into the nvprof traces, I notice that the cuBLAS call seems to be the cause of the roughly 7% performance drop. I did the above experiments several times and it happens every time. Therefore, IMHO it is not caused by warm-up effect.

It may be a caching effect. I don’t know how you would “mitigate” that, based on what has been provided so far in this posting.

If you are able to use Nsight Compute to profile, consumer models are able to have their clocks locked by ensuring the “Clock Control” setting is “Base”.