I am familiar with the “warmup” thing, where the first kernel run is sometimes slower than subsequent runs. I have the opposite situation. The first few runs are very fast. Later runs are slower, until it quickly reaches a steady state. The launch configurations are always the same, and the input is always the same. Any idea what’s going on?
Kernel run times:
-
0.74ms
-
0.94ms
-
0.85ms
-
3.39ms
-
5.41ms
-
5.76ms
-
5.62ms
-
5.61 ms …
Additional runs all seem to be in the 5.5 - 5.8ms range.
The code is of the following form:
for loop {
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
for (int channel = 0; channel < kNumChannels; channel++) {
kernel_call_1(input_d, output_d);
kernel_call_2(output_d, output_d); // In-place FFT
}
cudaDeviceSynchronize();
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
float elapsed_time;
cudaEventElapsedTime(&elapsed_time, start, stop);
cudaEventDestroy(start);
cudaEventDestroy(stop);
output_h = output_d;
Verify_output();
}
I didn’t always have the cudaDeviceSynchronize call. I thought I didn’t need it since everything was run with the default stream, but an nVidia doc said it should be used, so I threw it in just in case. Didn’t make a difference.
Any ideas as to why the timing is the way it is?