cudaMemcpyAsync with cudaMemcpyHostToDevice does not implicitly synchronize with stream

I’m observing this behavior:

  1. A call of cudaMemcpyAsync with direction cudaMemcpyHostToDevice on a given stream does NOT wait until kernel launches on that stream are complete (i.e. it starts copying immediately).

  2. A call of cudaMemcpyAsync with direction cudaMemcpyDeviceToHost on a given stream DOES wait until all kernel launches on the stream are complete.

Point 1 surprises me. Is that intended? I also didn’t find a clear statement in the documentation describing this behavior. In my use case the problem is that the cudaMemcpyAsync call overwrites a device-side buffer that is used by previous kernel launches on the same stream.

For now the workaround is an extra cudaStreamSynchronize prior to cudaMemcpyAsync. However, I want to get rid of that call since I don’t want to block the host thread.

Here’s a small reproducer: Compiler Explorer

Your code is not correct. You overwrite the data to be transferred with std::fill before the transfer is completed. This copies incorrect data.

Yes, I didn’t see that. Thank you!