Launch Overhead as a function of Kernel Size... Is it Proportional? Characterization?

I’m trying to figure out the first kernel launch overhead as well as the average expected overhead for subsequent launches. I’m also trying to characterize these overheads, mainly by referring to this forum

I also want to co-relate the first kernel launch overhead with the size of the kernel (as in total no. of instructions in the kernel code from the cubin file). The first kernel launch overhead should increase as the kernel size increases as the time taken to transport the kernel instructions from CPU to GPU will be more.

The Programming guide reports that the kernel size limit is about 2 Million instructions.

To set up my experiment, I’ve used an inline device function as shown below:

__device__ int sum (int a, int b)


        return a+b;


__global__ void testKernel1(int a, int b, int *d_out)


       int c=sum(a,b);



        //Add more invocations here



I add more invocations of the device function in order to increase the instruction size, run about 100 iterations of the entire kernel to give me the timings. I used the nvidia visual profiler to get the timings per function call. Startup overhead is measured as the difference between the first call and the subsequent calls.

When I surpass about 1000 of these function invocations, this startup kernel overhead seems to be shrinking.

Can anyone explain this behavior? Is there a better way to conduct this experiment?

I’ve never measured this, but I don’t think code size should have a significant effect on the kernel launch time.

Obviously the kernel code does have to be downloaded to GPU memory (at the first invocation), but this should be insignificant compared to other initialization steps (textures etc.).