strange GPU idle time in profiler

Hello!
I have a question about profiler.
my code looks like this :

memcpyAsync(…, host2device);
kernel1<<<>>>();
memcpyAsync(…, device2device);
kernel2<<<>>>();

So i suppose this code should run strictly sequentially (there is no CPU instructions between cuda operations invocations). But looking into some profiling info (attached) there is some gap between kernel1 invocation (blue strip) and following memcpy (red strip). And it repeats further after every kernel invocation.
What can be reason of such GPU idle times things?
profiler.png

Hello!
I have a question about profiler.
my code looks like this :

memcpyAsync(…, host2device);
kernel1<<<>>>();
memcpyAsync(…, device2device);
kernel2<<<>>>();

So i suppose this code should run strictly sequentially (there is no CPU instructions between cuda operations invocations). But looking into some profiling info (attached) there is some gap between kernel1 invocation (blue strip) and following memcpy (red strip). And it repeats further after every kernel invocation.
What can be reason of such GPU idle times things?

still intrested in

If you are using the WDDM driver on windows it could be due to batching - maybe a cudaStreamQuery(0) or a cudaStreamSynchronize(0) after the kernel launch fixes things.

Do you have profiler counters enabled? It takes the profiler time to read the counters back to the host and tally up the stats. I find that if you want to see the best idle timing results, you need to turn off all counters. Even then, the recorded gaps will be larger than the gaps in a profiler disabled run.