cuMemcpyDtoD - example (expected perf)

I am new to CUDA, I would like to use cuMemcpyDtoD between two devices in an asynchronous manner.

Is this doable ? (i.e., asynchronicity) ?
Does anyone have any sample code for this ?
What is the expected performance for such an action ?

cuMemcpyDtoD copies between two device memory locations on the same device.

Copying data between two different devices is a requested feature, but has not yet been implemented. For now, you have to do a cuMemcpyDtoH with one CUDA context, then cuMemcpyHtoD in another CUDA context (since each CUDA context can only talk to one device).

Thanks for the quick tip, so if I used the Async memory calls, from D2H (context1), and then H2D (context2).

So what kind of performance can I get (looking for MB/ms ratio) if you know ?

BTW all code examples in the SDK use macros such as CUDA_SAFE_CALL etc … did not see it in programmers manual or Ref guide, or in the h files, are there other resources ?

cutil.h from the SDK.

CUDA_SAFE_CALL automatically checks the error code when compiled with _DEBUG and does nothing when not.

By the way does this mean that CUDA_SAFE_CALL should not be used for production code?

Because if it ignores errors in the final product, that’s kinda bad…

CUDA_SAFE_CALL does a sync after every call in order to check errors. Generally, you’re going to want much coarser-grained error checking in production code than that.