Impact of the First Launched Kernel on Subsequent Ones

bojian.zheng · October 27, 2021, 8:17pm

Hi,

I have been doing performance benchmarking on RTX 2070 GPU. One thing that I notice is that the performance of the first launched kernels can somehow negatively affect the performance of subsequent ones. As an example:

// Benchmark 1
for (i = 0; i < 1000; ++i) 
  kernel_A<<<...>>>();
cudaDeviceSynchronize();
for (i = 0; i < 1000; ++i) 
  kernel_B<<<...>>>();

// Benchmark 2
for (i = 0; i < 1000; ++i) 
  kernel_B<<<...>>>();
cudaDeviceSynchronize();
for (i = 0; i < 1000; ++i) 
  kernel_B<<<...>>>();

// Benchmark 3
for (i = 0; i < 1000; ++i) 
  kernel_C<<<...>>>();
cudaDeviceSynchronize();
for (i = 0; i < 1000; ++i) 
  kernel_B<<<...>>>();

// Benchmark 4
for (i = 0; i < 1000; ++i) 
  kernel_B<<<...>>>();

The performance numbers reported by nvprof is roughly the same on Benchmark 2, 3, and 4, whereas the numbers on Benchmark 1 is roughly 10% worse than the others. I suspect the reason is because kernel_A has low SM occupancy in its execution (in its final execution wave, it is only able to dispatch one thread block per SM), but I am not sure how this can affect the performance of kernel_B, especially given that there is a cudaDeviceSynchronize call in between.

Could anyone please give me some hints on how to mitigate this problem? Thanks.

njuffa · October 28, 2021, 12:47am

It is hard to diagnose such an issue from a very generic description. Since you are asking for hints, consider (1) warm-up effects (2) dynamic clocking of the GPU. The first can be addressed by using proper benchmarking methodolody (e.g. do ten runs and report time of the fastest), the second can be addressed by fixing GPU clocks via nvidia-smi if your hardware supports it (in my experience, that is only the case for professional GPUs, not consumer models). After applying these measures, any deviations below 2% should be considered measurement noise.

bojian.zheng · October 28, 2021, 5:16pm

@njuffa Thank you so much for your reply.

GPU trace without cuBLAS.

GPU trace with a single cuBLAS call.

After diving into the nvprof traces, I notice that the cuBLAS call seems to be the cause of the roughly 7% performance drop. I did the above experiments several times and it happens every time. Therefore, IMHO it is not caused by warm-up effect.

Robert_Crovella · October 28, 2021, 5:58pm

It may be a caching effect. I don’t know how you would “mitigate” that, based on what has been provided so far in this posting.

rs277 · October 28, 2021, 7:20pm

If you are able to use Nsight Compute to profile, consumer models are able to have their clocks locked by ensuring the “Clock Control” setting is “Base”.

Topic		Replies	Views
Impact of the First Launched Kernel on Subsequent Ones CUDA Programming and Performance cuda	1	300	October 27, 2021
Strange Performance Issues Strange Performance Issues at the First Kernel Execution CUDA Programming and Performance	1	887	August 8, 2009
Decreased performances if CUDA kernels are not run continuously Jetson TX2	1	507	June 8, 2018
First kernel run is slower than succeeding CUDA Programming and Performance	9	3022	May 16, 2022
Repeated CUDA kernel calls get slower, not faster CUDA Programming and Performance	3	65	April 22, 2026
Peaks and slow performance with cudaDeviceSynchronize CUDA Programming and Performance cuda	6	2970	November 17, 2021
How different kernels affect the performance Performance issues CUDA Programming and Performance	3	4526	September 18, 2007
KERNELS are NOT queing , bug in cuda 2.0 ? cudathreadsynchronize() makes no difference ? CUDA Programming and Performance	12	5455	August 17, 2009
Timing Issue CUDA Programming and Performance	1	883	May 31, 2010
Oscilating performance, Code total times variates CUDA Programming and Performance	10	10725	June 21, 2009

Impact of the First Launched Kernel on Subsequent Ones

Related topics