Some documents said memory copy within a same device is asynchronous,so I think API “cuMemcpyDtoD()” is asynchronous,but what’s the use of API “cyMemcpyDtoDAsync()”? Under what circumstances I should call “cuMemcpyDtoDAsync()”?
First of all, you would use it when you are using the driver API. Second, you would use it (typically) when you want to copy data from one place in device global memory to another place in device global memory.
It is analogous to the runtime API function cudaMemcpyAsync, with the transfer kind “cudaMemcpyDeviceToDevice” specified.
cuMemcpyDtoD is a blocking API call, issued to the default stream only. note that the documentation:
“This function exhibits synchronous behavior for most use cases.”
cuMemcpyDtoDAsync can be a non-blocking call, issuable to any stream. Note that the documentation:
“This function exhibits asynchronous behavior for most use cases.”