startup overhead

I have question about overhead when calling kernel (function run on GPU).
What does overhead contain? does it contain copy input parameter to registers? what else does it ocntain?
and what is startup overhead? what does it contain?
for example in scanLargeArray ( file)sample of CUDA SDK 2.1 there is these lines:

(prescan function is for calling kernel function )

// run once to remove startup overhead
prescanArray(d_odata, d_idata, num_elements);

// Run the prescan


prescanArray(d_odata, d_idata, num_elements);


I dont’ understand why dose calling the kernel for first time remove the startup overhead for second time?
Dose it allocate memory ,… at first time and don’t do it again for second time?

AFAIK when you use the Runtime API, the first kernel call sets up the grid, possibly some more context initialization etc. In Driver API, you have to do this manually beforehand.