cuda overhead

I have question about overhead when calling kernel (function run on GPU).
What does overhead contain? does it contain copy input parameter to registers? what else does it ocntain?
and what is startup overhead? what does it contain?
for example in scanLargeArray sample of CUDA SDK 2.1 there is these lines:


// run once to remove startup overhead
prescanArray(d_odata, d_idata, num_elements);

// Run the prescan

prescanArray(d_odata, d_idata, num_elements);


I dont’ understand why doese calling the first kernel call remove the startup overhead in second kernel call?


It’s cuda initialization that happens at first call of a cuda kernel, or a cudaMemcpy, which takes some time. What happens there, at the initialization, I can only suppose.