kernel launch time expensive?

Hi , I want to run my entire application on the GPU to avoid memory transfers from the host to the device. But my application requires lot of sequential programming as such i have to divide it many kernels. So now in my application i have many kernel launches but no memory transfers as the results are in the GPU. Will many kernel launches hamper the speed??

Any thoughts??

I am not an expert but I think that kernel launches have a negligible hw and sw overhead.

If your kernels contain a reasonable amount of work (at least N times more blocks than SMs on the card) and you’re just queuing all of these launches, you shouldn’t see any impact.