Repeated CUDA kernel calls get slower, not faster

I am familiar with the “warmup” thing, where the first kernel run is sometimes slower than subsequent runs. I have the opposite situation. The first few runs are very fast. Later runs are slower, until it quickly reaches a steady state. The launch configurations are always the same, and the input is always the same. Any idea what’s going on?

Kernel run times:

  1. 0.74ms

  2. 0.94ms

  3. 0.85ms

  4. 3.39ms

  5. 5.41ms

  6. 5.76ms

  7. 5.62ms

  8. 5.61 ms …

Additional runs all seem to be in the 5.5 - 5.8ms range.

The code is of the following form:

for loop {
  cudaEventCreate(&start);
  cudaEventCreate(&stop);
  cudaEventRecord(start, 0);

  for (int channel = 0; channel < kNumChannels; channel++) {
    kernel_call_1(input_d, output_d);
    kernel_call_2(output_d, output_d); // In-place FFT
  }
  cudaDeviceSynchronize();

  cudaEventRecord(stop, 0);
  cudaEventSynchronize(stop);

  float elapsed_time;
  cudaEventElapsedTime(&elapsed_time, start, stop);
  cudaEventDestroy(start);
  cudaEventDestroy(stop);

  output_h = output_d;
  Verify_output();
}

I didn’t always have the cudaDeviceSynchronize call. I thought I didn’t need it since everything was run with the default stream, but an nVidia doc said it should be used, so I threw it in just in case. Didn’t make a difference.

Any ideas as to why the timing is the way it is?

It could be a thermal issue. The GPU might be throttling to stay below a certain temperature. To verify, try to monitor the GPU clocks and temperature, for example using nvidia-smi

cudaDeviceSynchronize should not be necessary at that place. The event is recorded in the default stream and cudaEventSynchronize will therefore sync with the default stream as well.

You were right, it was throttling. Thank you.