Where does kernel time's overhead come from? Fermi, Kernel time, profile

I am using CUDA SDK 3.2 on Windows 7. Suppose for each kernel, through profiling we know the execution time of this kernel is A. Then if we run this kernel N times continuously(For detailed information, please refer to the bottom), then the expected execution time is supposed to be NA. We also use a timer to measure the actual execution time of N runs. On Quadro FX4800, the expected time match the measured time well. If there is any deviation, it is smaller than 5%. However, on Fermi GTX580, this deviation is as large as 50%. For most of the kernels I’ve tested, the deviation is as large as this. The time for running this kernel N times actually is about 1.5NA instead of the supposed time NA(actual execution time is about 50% longer).

Compared to old GPGPUs, Fermi is a very new architecture. From Fermi’s documents, we know that different kernels from the same context can execute concurrently. Because of this new feature, it is reasonable that there might be some deviation. But the question is that if this is the case, then actual execution time should be smaller, such as NA/1.5 instead of the larger one such as NA*1.5.

Feel pretty confusing. Anyone can help me out? Appreciate your valuable opinion.

##############################################################
–For a single run, it is as follows:
clEnqueueNDRangeKernel();
clFinish();

–For N times run, it is as follows:
StartTimming
for(int i=0; i<N; i++)
{clEnqueueNDRangeKernel();}
clFinish();
StopTimming
##############################################################

What happens if you move the clFinish() into the for loop. Also be interesting to see the individual timings of each iteration.

Hi,

I did some tests according to your question under Linux OS. Through profiling, the execution time of this kernel is 0.56ms. Test results are as follows:

Number of runs: 20 200 2000 20000

Include clFinish inside the for loop: 39.167 158 1315 13085

Not Included : 31.962 134 1156 11385

Hi!

Isn’t it only the numbers under the 20 column that are way off? The others seem ok to me?

What if you run two different kernels, i.e.

for(int i=0; i<N; i++)

{

clEnqueueNDRangeKernel(KernelA);

clFinish();

clEnqueueNDRangeKernel(KernelB);

clFinish();

}

Do you get N*(A+B), where A and B are the single execution times of kernels A and B respectively?