I was trying to calculate the execution time of a kernel and noticed a weird thing.
If I run a kernel prior to the start of the timer(available with cutil) The execution time was infact lesser
kernel_MotionEst<<< grid, threads >>>(Target, Source, 0, 0); // running a kernel like this in advance
for (int i=0; i<10; i++)
kernel_MotionEst<<< grid, threads >>>(Target, Source, 0, 0);
I dont really know why this is happening. But certainly would like to know if there are some architectural aspects like calling a kernel in advance populates the cache and that causes less misses or something…
I understand that running a kernel in advance actually increases the overall execution time of the prog. But I am still interested if there are any architectural reasons for that.
I tried this with two different simple kernels and it holds true. I am using a Core 2 2.3 with 4Mb L2 and NVIDIA Tesla C1060