Speed up due to a kernel launch ?


I was trying to calculate the execution time of a kernel and noticed a weird thing.

If I run a kernel prior to the start of the timer(available with cutil) The execution time was infact lesser
for example

kernel_MotionEst<<< grid, threads >>>(Target, Source, 0, 0); // running a kernel like this in advance
for (int i=0; i<10; i++)
kernel_MotionEst<<< grid, threads >>>(Target, Source, 0, 0);

I dont really know why this is happening. But certainly would like to know if there are some architectural aspects like calling a kernel in advance populates the cache and that causes less misses or something…

I understand that running a kernel in advance actually increases the overall execution time of the prog. But I am still interested if there are any architectural reasons for that.

I tried this with two different simple kernels and it holds true. I am using a Core 2 2.3 with 4Mb L2 and NVIDIA Tesla C1060


  1. There is an initialization overhead which you encounter on your very first cuda call. So your first kernel is approx order of magnitude slower than all subsequent kernel calls. Check CUDA sdk for examples of this. I generally run a dummy kernel call with minimal amount of data to initalize the GPU.

  2. The way your are timing seems not correct as you have no cudaThreadsynchronize command after your kernel call. As the the kernel calls are non-blocking the control will immediately return to the CPU thread once you kernel is launched. Hence put a cudaThreadsynchronize command after each kernel if you want to effectively time them. For exact details see the programming guide…

Hope this helps…

Use CUDA event API, which provides calls that create and destroys events, record events etc. this is the best way to use timers, and is safer as well.

I forgot to put the cudaThreadSynchronize part here when posting it. I’ve used it… I was wondering what is the exact reason for the kernel launch overhead?..