Host-device memory copies up to 64 KB are asynchronous

In section of the updated CUDA C programming guide it says that host <-> device memory copies of a memory block of 64 KB or less are asynchronous. Is this a change in behavior? I thought cudaMemcpy() calls were always synchronous.

It’s not a change in behavior, actually. Kind of hard to explain without presenting a full picture of how copies work, but basically we can immediately send that to the GPU without having to wait for prior work to finish such that when cudaMemcpy returns you can still free that host memory.

Are the last (size % 64 KB) of a larger copy are also sent asynchronously?

No, this is only for copies that are less than 64KB.

So is this code incorrect if bytes <= 64 KB?

[codebox]// Launch the first kernel

kenel1<<< , >>>(data_d);

// Readback the results from kernel 1

cudaMemcpy(data_h, data_d, bytes, cudaMemcpyDeviceToHost);

// Depending on the results of the first kernel, launch one of two possible versions of kernel 2

if (*data_h)

kernel2a<<< , >>>(data_d);


kernel2b<<< , >>>(data_d);[/codebox]

From my understanding of the new documentation, if bytes is <= 64 KB the cudaMemcpy() call is asynchronous and the branching test to call kernel2a or kernel2b won’t always work properly because data_h may not be copied yet.

No, it is not. Host-to-device copies of less than 64KB performed via cuMemcpyHtoD or cudaMemcpy(…, cudaMemcpyHostToDevice) are asynchronous in the sense that they may return before the data is actually available on the GPU. However, the user is still free to change the buffer as soon as those calls return with no ill effects; they’re placed in a staging buffer (for various reasons) that is then sent to the GPU.

Device-to-host copies performed via cuMemcpyDtoH or cudaMemcpy(…, cudaMemcpyDeviceToHost) are never asynchronous.

Then the new documentation seems to be wrong. It says the following are asynchronous:

Host <-> device memory copies of a memory block of 64 KB or less;

The arrows pointing in both directions implies copies in both directions, whereas you are saying:

Host -> device memory copies of a memory block of 64 KB or less;

Then yeah, it’s wrong.