Warmup kernel and measure time

I have a program like this:
main
{

call fun1;
call fun2;

}
fun1 ( )
{
cudaSetDevice
cudaDeviceSynchronize
cudaDeviceReset

cudaEventRecord start
call kernel1
cudaGetLastError
cudaDeviceSynchronize
cudaEventRecord stop
cudaEventSynchronize stop

cudaDeviceReset
}
fun2 ( )
{
cudaSetDevice
cudaDeviceSynchronize
cudaDeviceReset

cudaEventRecord start
call kernel2
cudaGetLastError
cudaDeviceSynchronize
cudaEventRecord stop
cudaEventSynchronize stop

cudaDeviceReset
}
Is it necessary to have a warmup kernel?
i.e. to call the same kernel twice.
I have different execution times if i run the code many times.
i.e
in first run
kernel1 1200ms
kernel2 600ms

in second run
kernel1 2200ms
kernel2 600ms

or
in first run
kernel1 1300ms
kernel2 700ms

or any other.
Sometimes the first run of kernel1 differ a lot from second run of kernel1.
Sometimes the first run have a lot differ on kernel1-kernel2 from second run.

  1. I have cudaDeviceReset.
    Is this enough to replace the warmup kernel?
  2. Ιn case that warmup kernel was needed, it should be in both functions?
  3. I have a GTX960 and it work simultaneously like vga card.
    I know to have some others work to do when i run the program.
    When i run it, i dont do anything else.
    how the deviation in times is justified .

Thanks to all.

I don’t understand why you use cudaDeviceReset() .It does not do warm-up.

Runtime can vary when there is other work performed on the gpu, for example display output, or if not the same exact clock speeds are used.

You don’t need cudaDeviceSynchronize before cudaEventRecord stop

In the beginning i use cudaDeviceReset to be sure I will have the gpu in most firstly mode (all var to be clear, nothing previous running etc).
To the end i use it, to clear all variables etc.

I know that is not necessary for cudaDeviceSynchronize before cudaEventRecord stop.
The statement was left before I add the statements for the time measure.
Does this cause an error in the measurements?

If i had a second gpu without display output connected, then could I have a more stable measurement?

What is this “if not the same exact clock speeds are used”?
You mean the gpu clock?
Where could i check this and how to change?

With cudaDeviceSynchronize before cudaEventRecord, the time will slightly increase because of the added latency.

nvidia-smi can display and control clock frequency. This is just some brainstorming. For time measurements, I usually launch a kernel multiple times in a for-loop, say 100 times, and measure the total runtime / average runtime for comparison between kernels. I don’t worry about the frequencies.

It is impossible to reason about a kernel which you have not shown.

Thanks.
I’ll try it and i’ll tell you.

I tried the above method and got logical results.
But I noticed that I have different times in HtD and DtH.

cudaEventRecord (startHtD)
cudaMalloc HtD
cudaEventRecord (stopHtD)

… kernel process …

cudaEventRecord (startDtH)
cudaMallocDtH
cudaEventRecord (stopDth)

While the HtD time is measurable ~ 30ms the DtH is 0.0000ms.
The array that is transferred both times is the same (in the second case processed with the kernel).
Is the DtH process faster than the HtD to a point where it cannot be measured?

Please show the exact code. I don’t know what you mean by cudaMallocDtH.

cudaEventRecord (startHtD);
cudaMemcpy (dev_imageCnv, imageCnv, imageCnv_lin * imageCnv_col * channels * sizeof (uint8_t), cudaMemcpyHostToDevice);
cudaEventRecord (stopHtD);

cudaEventRecord (startDtH);
cudaMemcpy (imageCnv, dev_imageCnv, imageCnv_lin * imageCnv_col * channels * sizeof (uint8_t), cudaMemcpyDeviceToHost);
cudaEventRecord (stopDtH);

The usage is correct.

I’ll check it with profiler and i’ll tell you.

I followed your advice and ran the code with 100 repetitions.
I repeated this process 5 times to compare the trend of the times.
I notice that in the first 3 repetitions, the 100 times have an increasing tendency, while in the next 2 they remain almost constant.
I imagine it was a coincidence that stability happened to occur in the last 2 repetitions.
But for the increase of the times in the process of repetitions I would expect the opposite.
Logically the first times should be increased and then either stabilize or slowly decrease.
Specifically the first time is at 1728ms and the last at 1733ms or 1731ms-1734ms,
while those that remain almost constant change around 1734.
The average is 1732, 1732, 1733, 1734, 1734ms.
Obviously not a measurable difference.

I just wanted to know if I should run the code 100 times or 50 or less.
In the last two cases, for example, I could have the same results with less than 100 repetitions.

Also if the theory of warm up is valid, the times should not be reduced as the repetitions increase, since now the gpu wake up from idle.