fundamental cuda kernel launch questions

I have couple of question regarding Cuda kernel launch:

(1) What factors determine the launch overhead ? I heard from someone that it is 10us, but for some application I saw that it is variable from 20us to 27us. Why is it variable and what factors (number of threads, blocks, grid, size of code, etc…) determine the launch overhead and how much they contribute towards the launch overhead ?

(2) If a kernel is repeatedly called within a loop and used mainly as a mean of synchronization/barrier between different iteration of the execution, is it possible to bring down the kernel launch overhead ? I find that for some applications the kernel launch overhead can be a big problem especially if the kernel is invoked zillions times and each time the kernel takes hundreds of microsecond to execute. In the above case kernel launch overhead (10us-30us …) can be significantly reduced if we know that kernel is launched mainly to enable barrier synchronization.

My very limited experimentation showed a linear relationship between the number of blocks executed and the launch overhead. I timed ~1.0ms / 60,000 blocks on an 8800 GTX (launching an emtpy kernel). There is probably also a weak dependency on the number of bytes of arguments in the kernel parameter list.

I don’t know for certain, but at least with the runtime API I don’t think there is. Perhaps a driver API expert could comment on whether it is possible to setup the launch once and then launch the kernel several times (and if that has any effect on the launch overhead…).

One technique I played around with at one point was to put kernel arguments that didn’t change from call to call (such as the data pointers) into constant memory once for the whole series of kernel launches. I did notice a slight increase in performance but I didn’t document it.

Driver API of CUDA 2.0 seems to have less overhead, but I have not done any direct comparison.