Hello,
I have done a while loop which executes:
- a kernel and;
- a host function which do exactly the same operation in the kernel.
The objective is to measure the time which each function (host and device) takes. I have noticed that the first execution of the kernel (the first loop) is much faster than the others. Look below the results:
Type N: 1000
Type numThreadPerBlock (<= 512): 512
elapsedTimeCPU = 1.875000 miliseconds
elapsedTimeGPU = 0.088000 miliseconds
factor = 21.306818 (elapsedTimeCPU/elapsedTimeGPU)
Type N: 1000
Type numThreadPerBlock (<= 512): 512
elapsedTimeCPU = 1.848000 miliseconds
elapsedTimeGPU = 0.267000 miliseconds
factor = 6.921349 (elapsedTimeCPU/elapsedTimeGPU)
Type N: 1000
Type numThreadPerBlock (<= 512): 512
elapsedTimeCPU = 1.847000 miliseconds
elapsedTimeGPU = 0.268000 miliseconds
factor = 6.891791 (elapsedTimeCPU/elapsedTimeGPU)
Type N: 1000
Type numThreadPerBlock (<= 512): 512
elapsedTimeCPU = 1.847000 miliseconds
elapsedTimeGPU = 0.268000 miliseconds
factor = 6.891791 (elapsedTimeCPU/elapsedTimeGPU)
Type N: 1000
Type numThreadPerBlock (<= 512): 512
elapsedTimeCPU = 1.862000 miliseconds
elapsedTimeGPU = 0.269000 miliseconds
factor = 6.921933 (elapsedTimeCPU/elapsedTimeGPU)
Type N: 1000
Type numThreadPerBlock (<= 512): 512
elapsedTimeCPU = 1.850000 miliseconds
elapsedTimeGPU = 0.269000 miliseconds
factor = 6.877324 (elapsedTimeCPU/elapsedTimeGPU)
I have already check for errors in the first execution, but I found nothing. I am using the timer functions of the cutil library and I call cutilSafeThreadSync() before the beginning and end of timing.
Is there anyone who have already noticed that or which can try to reproduce the error with a simple kernel?