Are cudaMemCpy and cudaMalloc blocking/synchronous?

Hi,

I am trying to understand and reconcile two sources of documentation regarding the synchronization behavior of a memory copy on the same device (cudaMemCpy with cudaMemcpyDeviceToDevice).

According to this paragraph in the CUDA programming guide:

a memory copy between two addresses to the same device memory,

seem to indicate that such cudaMemCpy will act as a synchronization barrier for commands in different streams but not for any work done in the NULL/default stream, aka it doesn’t provide any guarantee in terms of synchronicity with the host. This behavior is the one described in this other post.

On the other hand, this page in the cuda toolkit documentation mentions:

  1. For transfers from device memory to device memory, no host-side synchronization is performed.

Is my understanding correct that cudaMemCpy (with cudaMemcpyDeviceToDevice) is non-blocking from the host thread? In contrast, a cudaMemCpy with cudaMemcpyDeviceToHost is both blocking and synchronizing w.r.t. to the host thread?

What about cudaMalloc? I understand this is blocking (from the first link) for any work on different streams. But is it synchronizing w.r.t. to host thread?

Thank you.

cudaMemcpy is an operation issued to the NULL stream. Assuming no changes to NULL stream behavior, it has NULL stream semantics. That means that:

  • all previously issued work (cudaMemcpyxxx, kernel launches) to that device must complete before the cudaMemcpy is allowed to begin, and
  • no subsequently issued work to that device can begin until the cudaMemcpy operation is complete.

The above statements are true regardless of the direction-of-transfer specification. The above statements refer to scheduling of work on the device, not as directly to host activity.

A cudaMemcpy operation D->D (same device) is non-blocking to the host thread. If you want to prove this to yourself, a simple test can be constructed. Perform a large D->D transfer, time the CPU duration of the call, compare the indicated bandwidth to a plausible upper bound measurement number. (For example, I have done this just now on a L4 GPU. I did a 4GB transfer D2D - 4GB of read and 4GB of write, and measured 9 microseconds duration. Not a plausible transfer time, therefore the call must be non-blocking to the host thread.) But refer to above statements as to when it actually begins/ends.

cudaMemcpy with H->D or D->H direction specification is generally blocking to the host thread. Certainly in the D->H direction this is necessary for CUDA to behave correctly; it’s discoverable in the most basic introductory CUDA treatments.

cudaMalloc generally originally required a cessation of device activity, notionally because it is modifying the GPU memory map. Therefore on the device side it manifests as a “gap” in device activity; not unlike an operation issued to the NULL stream. This is the “synchronization” used in the first link of yours. However with recent implementations of e.g. cudaMallocAsync and default memory pools, this may have changed somewhat. I wouldn’t suggest depending on any specific behavior here, but you should acknowledge that it could be synchronizing and that might be “unwanted”.

With respect to host thread, I believe cudaMalloc is blocking. Since no particular behavior is specified, you should assume that the behavior could be either blocking or non-blocking with respect to host thread, and make your code correct in spite of that. (I’m honestly not aware of how you might construct your code differently in either case, alternatively: “why does it matter?”). However, my advice to CUDA programmers (primarily because of the potential for device-side synchronization, as already mentioned) is to get any non-essential operations out of carefully crafted work-issuance loops. When it is impossible to remove cudaMalloc-style operations from work issuance loops, then it may be worthwhile to investigate cudaMallocAsync along with memory pools.

When teaching CUDA I generally try to use the word “synchronizing” to refer to effects on device activity, and “blocking” to refer to effects on host thread behavior; I hope I have not mixed anything up here. CUDA kernel launches are generally referred to as asynchronous; this has meaning for both host and device activity.

1 Like