I just wrote a program, where I use a kernel a lot of times in a for loop. I copy the necessary data from host to device in every iteration. Everything is done in the same stream (not the default one).
Profiling my program I recognised, that the call cudaMemcpy2DAsync normally takes a really long latency, but it is not the case in the first two iterations.
How does it come? Can I reduce the latency without using multiple streams?
The profiler outcome is attached