I’m observing this behavior:
-
A call of cudaMemcpyAsync with direction cudaMemcpyHostToDevice on a given stream does NOT wait until kernel launches on that stream are complete (i.e. it starts copying immediately).
-
A call of cudaMemcpyAsync with direction cudaMemcpyDeviceToHost on a given stream DOES wait until all kernel launches on the stream are complete.
Point 1 surprises me. Is that intended? I also didn’t find a clear statement in the documentation describing this behavior. In my use case the problem is that the cudaMemcpyAsync call overwrites a device-side buffer that is used by previous kernel launches on the same stream.
For now the workaround is an extra cudaStreamSynchronize prior to cudaMemcpyAsync. However, I want to get rid of that call since I don’t want to block the host thread.
Here’s a small reproducer: Compiler Explorer