I am using CUDA SDK 3.2 on Windows 7. Suppose for each kernel, through profiling we know the execution time of this kernel is A. Then if we run this kernel N times continuously(For detailed information, please refer to the bottom), then the expected execution time is supposed to be NA. We also use a timer to measure the actual execution time of N runs. On Quadro FX4800, the expected time match the measured time well. If there is any deviation, it is smaller than 5%. However, on Fermi GTX580, this deviation is as large as 50%. For most of the kernels I’ve tested, the deviation is as large as this. The time for running this kernel N times actually is about 1.5NA instead of the supposed time NA(actual execution time is about 50% longer).
Compared to old GPGPUs, Fermi is a very new architecture. From Fermi’s documents, we know that different kernels from the same context can execute concurrently. Because of this new feature, it is reasonable that there might be some deviation. But the question is that if this is the case, then actual execution time should be smaller, such as NA/1.5 instead of the larger one such as NA*1.5.
Feel pretty confusing. Anyone can help me out? Appreciate your valuable opinion.
##############################################################
–For a single run, it is as follows:
clEnqueueNDRangeKernel();
clFinish();
–For N times run, it is as follows:
StartTimming
for(int i=0; i<N; i++)
{clEnqueueNDRangeKernel();}
clFinish();
StopTimming
##############################################################