Speed up due to a kernel launch ?

Saleel · December 23, 2009, 11:41pm

Hi

I was trying to calculate the execution time of a kernel and noticed a weird thing.

If I run a kernel prior to the start of the timer(available with cutil) The execution time was infact lesser
for example

kernel_MotionEst<<< grid, threads >>>(Target, Source, 0, 0); // running a kernel like this in advance
for (int i=0; i<10; i++)
{
cutilCheckError(cutStartTimer(timer1));
kernel_MotionEst<<< grid, threads >>>(Target, Source, 0, 0);
cutilCheckError(cutStopTimer(timer1));
}

I dont really know why this is happening. But certainly would like to know if there are some architectural aspects like calling a kernel in advance populates the cache and that causes less misses or something…

I understand that running a kernel in advance actually increases the overall execution time of the prog. But I am still interested if there are any architectural reasons for that.

I tried this with two different simple kernels and it holds true. I am using a Core 2 2.3 with 4Mb L2 and NVIDIA Tesla C1060

nitin.life · December 24, 2009, 11:49pm

Hi

I was trying to calculate the execution time of a kernel and noticed a weird thing.

If I run a kernel prior to the start of the timer(available with cutil) The execution time was infact lesser

for example

kernel_MotionEst<<< grid, threads >>>(Target, Source, 0, 0); // running a kernel like this in advance

for (int i=0; i<10; i++)
{

	cutilCheckError(cutStartTimer(timer1));

	kernel_MotionEst<<< grid, threads >>>(Target, Source, 0, 0);

	cutilCheckError(cutStopTimer(timer1));

}
I dont really know why this is happening. But certainly would like to know if there are some architectural aspects like calling a kernel in advance populates the cache and that causes less misses or something…

I understand that running a kernel in advance actually increases the overall execution time of the prog. But I am still interested if there are any architectural reasons for that.

I tried this with two different simple kernels and it holds true. I am using a Core 2 2.3 with 4Mb L2 and NVIDIA Tesla C1060

TWO THINGS:

There is an initialization overhead which you encounter on your very first cuda call. So your first kernel is approx order of magnitude slower than all subsequent kernel calls. Check CUDA sdk for examples of this. I generally run a dummy kernel call with minimal amount of data to initalize the GPU.
The way your are timing seems not correct as you have no cudaThreadsynchronize command after your kernel call. As the the kernel calls are non-blocking the control will immediately return to the CPU thread once you kernel is launched. Hence put a cudaThreadsynchronize command after each kernel if you want to effectively time them. For exact details see the programming guide…

Hope this helps…

heshsham_India · December 25, 2009, 11:55am

Use CUDA event API, which provides calls that create and destroys events, record events etc. this is the best way to use timers, and is safer as well.

Saleel · December 26, 2009, 8:22am

I forgot to put the cudaThreadSynchronize part here when posting it. I’ve used it… I was wondering what is the exact reason for the kernel launch overhead?..

Topic		Replies	Views
Kernel execution overhead CUDA Programming and Performance	2	1159	July 6, 2009
Kernel Timing and cudaThreadSynchronize() CUDA Programming and Performance	6	2001	July 30, 2010
Strange Performance Issues Strange Performance Issues at the First Kernel Execution CUDA Programming and Performance	1	838	August 8, 2009
Timing the Kernel CUDA Programming and Performance	3	3727	January 15, 2010
Reduction of kernel's execution time that does not make sense CUDA Programming and Performance	4	555	January 11, 2018
Getting different time for kernel execution. CUDA Programming and Performance	6	5897	July 30, 2009
What could be possible reasons for affecting the kernel launch overhead for fast small kernels? CUDA Programming and Performance	5	24	October 22, 2024
First kernel execution takes longer CUDA Programming and Performance	8	2841	December 8, 2014
On timing and timer CUDA Programming and Performance	7	4190	July 15, 2009
Why kernel calculate speed got slower after waiting for a while? CUDA Programming and Performance cuda	9	1707	July 19, 2022

Speed up due to a kernel launch ?

Related topics