cudaMemcpy2DAsync long latency


I just wrote a program, where I use a kernel a lot of times in a for loop. I copy the necessary data from host to device in every iteration. Everything is done in the same stream (not the default one).

Profiling my program I recognised, that the call cudaMemcpy2DAsync normally takes a really long latency, but it is not the case in the first two iterations.
How does it come? Can I reduce the latency without using multiple streams?

The profiler outcome is attached

I thought, my cpu is not fast enough to issue all calls in time, so I made the kernel much longer. The effect was, that the latencies became even bigger!