I have couple of question regarding Cuda kernel launch:
(1) What factors determine the launch overhead ? I heard from someone that it is 10us, but for some application I saw that it is variable from 20us to 27us. Why is it variable and what factors (number of threads, blocks, grid, size of code, etc…) determine the launch overhead and how much they contribute towards the launch overhead ?
(2) If a kernel is repeatedly called within a loop and used mainly as a mean of synchronization/barrier between different iteration of the execution, is it possible to bring down the kernel launch overhead ? I find that for some applications the kernel launch overhead can be a big problem especially if the kernel is invoked zillions times and each time the kernel takes hundreds of microsecond to execute. In the above case kernel launch overhead (10us-30us …) can be significantly reduced if we know that kernel is launched mainly to enable barrier synchronization.