Hi,
Someone on github, told me that cudaMemcpyAsync + cudaStreamSynchronize on defalutl stream is equal to cudaMemcpy (non-async), below is implementation of cudaMemcpy.
However I’m doubt about it.
I find description of default stream on this site https://docs.nvidia.com/cuda/cuda-runtime-api/stream-sync-behavior.html#stream-sync-behavior. The legacy default stream is an implicit stream which synchronizes with all other streams in the same CUcontext except for non-blocking streams
So, I think cudaMemcpyAsync + cudaStreamSynchronize will cost more than cudaMemcpy (non-async).
Am I right ?
looking forward to any reply.
Thank you.
What performance differences did you find when you tried these two variants? Practical experiments are a valuable check on thought experiments.
My memory is a bit hazy (and hopefully someone more knowledgeable will correct me if I am wrong here), but I am reasonably sure cudaMemcpy() includes an implicit cudaDeviceSynchronize(). Which means it waits for everything. As a consequence, I would generally expect higher performance from cudaMemcpyAsync() + cudaStreamSynchronize() versus use of cudaMemcpy(). But in many use case, thee may not be a discernable performance difference outside of measurement noise (±2%).
Hi, njuffa
Thank you for reply, I have made a study on the cudaMemcpy, and it seems that cudaMemcpy doesn’t include an implicit cudaDeviceSynchronize.
This is my test code.
You are correct, cudaMemcpy doesn’t include implicit cudaDeviceSynchronize. If you’re writing straight CUDA, nothing include an implicit cudaDeviceSynchronize that I can think of at the moment. Maybe some CUDA libraries. That’s not to say a function is blocking, like cudaMemcpy.
cudaMemcpy includes a synchronization on the default stream. It does not include the equivalent of cudaDeviceSynchronize().
Normally, a synchronization on the default stream synchronizes all other created streams on that device. However, if you create a stream with the cudaStreamNonBlocking flag, that stream will not be synchronized by a synchronization in the default stream.
However, the non-blocking stream will be synchronized by cudaDeviceSynchronize(), which synchronizes all previously issued work to that device, regardless of stream.
Okay, I’ve modified the example at https://devblogs.nvidia.com/how-overlap-data-transfers-cuda-cc/ to create two simple scenarios that indeed make cudaMemcpy is equal to cudaMemcpyAsync + cudaStreamSynchronize. One is with the default stream creation and one with the cudaStreamNonBlocking flag. I’ve added screenshots of Nsight Systems as well.
I can’t guarantee that it is true in every scenario.