Sometimes I face the situation when I have to wait until some kernel finishes and thus I run DeviceSynchronize() or StreamSynchronize(). At this point host stucks and waits until the kernel finishes and unable to prepare for the next kernel launch. As the result, after Synch is finished, host spends additional 10 – 15 microseconds launching the next kernel and on fast GPUs this can create a really huge relative overhead.
I wonder, is there a way to somehow begin preparation of the next kernel launch in advance, in order to avoid this overhead? Many thanks!