GPU cimputing time problem cutGetTimerValue bug?

Hello everyone, I would like to share a puzzling problem with you. I am in the process to re-write a big CFD code from C/C++ to C/CUDA; I managed to successfully complete an important step by having part of the whole solver to run on the GPU. I am currently testing on the Fermi C2050 GPU. Since the most important aspect, besides the quality of the results obtained which is excellent when compared to the output of the serial CPU code, is the overall achieved speed-up, I wanted to conduct a series of tests where I am invoking the same application 100 times, in order to see if the results I obtain are always the same. This is the case as far as the code relevant output is concerned, but to my surprise, I noticed a big scattering in the GPU computing times I measure in the CUDA code. The structure of the code is as follows:

void CUDA_CODE(…){

//allocate GPU memory,create CUDA grid/blocks

// timer kernel
unsigned int timer_kernel = 0;

cudaThreadSynchronize();
cutCreateTimer(&timer_kernel);
cutStartTimer(timer_kernel);

//invoke GPU kernels and solve the iteration loop N times

cudaThreadSynchronize();
cutStopTimer(timer_kernel);
ctime_kernel = cutGetTimerValue(timer_kernel);

return;
}

The GPU times I then get can be subdivided in 3 groups :

-runs that took 118s
-runs that took 130s
-runs that took 149s

Regardless of the number of instances I launch the same application,
I always obtain the same results in terms of CUDA code output. So apparently there’s no programming error as far as the algorithm designed
in CUDA is concerned (at least I believe). But can please somebody explain to me what is going on with the GPU times?

Many thanks

Regards

Marco

P.S. The overall speed-up compared to the serial CPU version of the code is excellent, I obtain a 2 order of magnitude speed-up

If you are on Windows and not using the TCC driver, you might be a victim of the batching that the driver does to work around the high kernel invocation times on that platform.

Placing a few cudaStreamQuery(0) in strategic places in order to flush the queue might help to consistently speed up your kernel. Recording an event and querying it is another technique to achieve the same.