Hi
I was trying to calculate the execution time of a kernel and noticed a weird thing.
If I run a kernel prior to the start of the timer(available with cutil) The execution time was infact lesser
for example
kernel_MotionEst<<< grid, threads >>>(Target, Source, 0, 0); // running a kernel like this in advance
for (int i=0; i<10; i++)
{
cutilCheckError(cutStartTimer(timer1));
kernel_MotionEst<<< grid, threads >>>(Target, Source, 0, 0);
cutilCheckError(cutStopTimer(timer1));
}
I dont really know why this is happening. But certainly would like to know if there are some architectural aspects like calling a kernel in advance populates the cache and that causes less misses or something…
I understand that running a kernel in advance actually increases the overall execution time of the prog. But I am still interested if there are any architectural reasons for that.
I tried this with two different simple kernels and it holds true. I am using a Core 2 2.3 with 4Mb L2 and NVIDIA Tesla C1060