Launch kernel in advance?

Hello all!

Sometimes I face the situation when I have to wait until some kernel finishes and thus I run DeviceSynchronize() or StreamSynchronize(). At this point host stucks and waits until the kernel finishes and unable to prepare for the next kernel launch. As the result, after Synch is finished, host spends additional 10 – 15 microseconds launching the next kernel and on fast GPUs this can create a really huge relative overhead.

I wonder, is there a way to somehow begin preparation of the next kernel launch in advance, in order to avoid this overhead? Many thanks!

Do the next kernel launch before the cudaDeviceSynchronize()

cuda graphs may also help to avoid launch latency