Repeated CUDA kernel calls get slower, not faster

jim.clay · April 21, 2026, 5:29pm

I am familiar with the “warmup” thing, where the first kernel run is sometimes slower than subsequent runs. I have the opposite situation. The first few runs are very fast. Later runs are slower, until it quickly reaches a steady state. The launch configurations are always the same, and the input is always the same. Any idea what’s going on?

Kernel run times:

0.74ms
0.94ms
0.85ms
3.39ms
5.41ms
5.76ms
5.62ms
5.61 ms …

Additional runs all seem to be in the 5.5 - 5.8ms range.

The code is of the following form:

for loop {
  cudaEventCreate(&start);
  cudaEventCreate(&stop);
  cudaEventRecord(start, 0);

  for (int channel = 0; channel < kNumChannels; channel++) {
    kernel_call_1(input_d, output_d);
    kernel_call_2(output_d, output_d); // In-place FFT
  }
  cudaDeviceSynchronize();

  cudaEventRecord(stop, 0);
  cudaEventSynchronize(stop);

  float elapsed_time;
  cudaEventElapsedTime(&elapsed_time, start, stop);
  cudaEventDestroy(start);
  cudaEventDestroy(stop);

  output_h = output_d;
  Verify_output();
}

I didn’t always have the cudaDeviceSynchronize call. I thought I didn’t need it since everything was run with the default stream, but an nVidia doc said it should be used, so I threw it in just in case. Didn’t make a difference.

Any ideas as to why the timing is the way it is?

striker159 · April 22, 2026, 5:01am

It could be a thermal issue. The GPU might be throttling to stay below a certain temperature. To verify, try to monitor the GPU clocks and temperature, for example using nvidia-smi

cudaDeviceSynchronize should not be necessary at that place. The event is recorded in the default stream and cudaEventSynchronize will therefore sync with the default stream as well.

jim.clay · April 22, 2026, 12:52pm

You were right, it was throttling. Thank you.

Topic		Replies	Views
Repeated CUDA kernel calls get slower, not faster Compute Sanitizer	0	15	April 30, 2026
the same thing, different time consuming asking for help CUDA Programming and Performance	5	6323	May 26, 2009
Odd Slowdown Problem Same function slows down in loop CUDA Programming and Performance	3	9965	February 8, 2008
Strange Runtime behavior CUDA Programming and Performance	7	3201	December 18, 2009
Kernel Timing and cudaThreadSynchronize() CUDA Programming and Performance	6	2104	July 30, 2010
Oscilating performance, Code total times variates CUDA Programming and Performance	10	10713	June 21, 2009
Peaks and slow performance with cudaDeviceSynchronize CUDA Programming and Performance cuda	6	2960	November 17, 2021
stream synchronize problem CUDA Programming and Performance	2	795	August 28, 2017
cudaDeviceSynchronize 50x slower on TK1 Jetson TK1	2	1053	August 7, 2016
Kernel function calls in regards to cudaSynchronizeDevice(); CUDA Programming and Performance	2	718	May 25, 2017

Repeated CUDA kernel calls get slower, not faster

Related topics