Are cudaMemCpy and cudaMalloc blocking/synchronous?

samuel.aldana · September 30, 2024, 7:58pm

Hi,

I am trying to understand and reconcile two sources of documentation regarding the synchronization behavior of a memory copy on the same device (cudaMemCpy with cudaMemcpyDeviceToDevice).

According to this paragraph in the CUDA programming guide:

a memory copy between two addresses to the same device memory,

seem to indicate that such cudaMemCpy will act as a synchronization barrier for commands in different streams but not for any work done in the NULL/default stream, aka it doesn’t provide any guarantee in terms of synchronicity with the host. This behavior is the one described in this other post.

On the other hand, this page in the cuda toolkit documentation mentions:

For transfers from device memory to device memory, no host-side synchronization is performed.

Is my understanding correct that cudaMemCpy (with cudaMemcpyDeviceToDevice) is non-blocking from the host thread? In contrast, a cudaMemCpy with cudaMemcpyDeviceToHost is both blocking and synchronizing w.r.t. to the host thread?

What about cudaMalloc? I understand this is blocking (from the first link) for any work on different streams. But is it synchronizing w.r.t. to host thread?

Thank you.

Robert_Crovella · September 30, 2024, 8:28pm

cudaMemcpy is an operation issued to the NULL stream. Assuming no changes to NULL stream behavior, it has NULL stream semantics. That means that:

all previously issued work (cudaMemcpyxxx, kernel launches) to that device must complete before the cudaMemcpy is allowed to begin, and
no subsequently issued work to that device can begin until the cudaMemcpy operation is complete.

The above statements are true regardless of the direction-of-transfer specification. The above statements refer to scheduling of work on the device, not as directly to host activity.

A cudaMemcpy operation D->D (same device) is non-blocking to the host thread. If you want to prove this to yourself, a simple test can be constructed. Perform a large D->D transfer, time the CPU duration of the call, compare the indicated bandwidth to a plausible upper bound measurement number. (For example, I have done this just now on a L4 GPU. I did a 4GB transfer D2D - 4GB of read and 4GB of write, and measured 9 microseconds duration. Not a plausible transfer time, therefore the call must be non-blocking to the host thread.) But refer to above statements as to when it actually begins/ends.

cudaMemcpy with H->D or D->H direction specification is generally blocking to the host thread. Certainly in the D->H direction this is necessary for CUDA to behave correctly; it’s discoverable in the most basic introductory CUDA treatments.

cudaMalloc generally originally required a cessation of device activity, notionally because it is modifying the GPU memory map. Therefore on the device side it manifests as a “gap” in device activity; not unlike an operation issued to the NULL stream. This is the “synchronization” used in the first link of yours. However with recent implementations of e.g. cudaMallocAsync and default memory pools, this may have changed somewhat. I wouldn’t suggest depending on any specific behavior here, but you should acknowledge that it could be synchronizing and that might be “unwanted”.

With respect to host thread, I believe cudaMalloc is blocking. Since no particular behavior is specified, you should assume that the behavior could be either blocking or non-blocking with respect to host thread, and make your code correct in spite of that. (I’m honestly not aware of how you might construct your code differently in either case, alternatively: “why does it matter?”). However, my advice to CUDA programmers (primarily because of the potential for device-side synchronization, as already mentioned) is to get any non-essential operations out of carefully crafted work-issuance loops. When it is impossible to remove cudaMalloc-style operations from work issuance loops, then it may be worthwhile to investigate cudaMallocAsync along with memory pools.

When teaching CUDA I generally try to use the word “synchronizing” to refer to effects on device activity, and “blocking” to refer to effects on host thread behavior; I hope I have not mixed anything up here. CUDA kernel launches are generally referred to as asynchronous; this has meaning for both host and device activity.

Topic		Replies	Views
cudaDeviceSynchronize needed between kernel launch and cudaMemcpy ? CUDA Programming and Performance	15	16269	September 29, 2017
cudaMemcpyAsync clarification required & help needed CUDA Programming and Performance	0	1751	October 17, 2009
CUDA implicit synchronization behavior and conditions in detail CUDA Programming and Performance	3	1772	April 29, 2023
Questions about when using cudaMemcpyAsync(), the host is blocked CUDA Programming and Performance	6	3545	April 5, 2018
Is cudaMemset actually "asynchronous"? CUDA Programming and Performance	5	7830	January 5, 2016
CPU blocked MUCH longer than expected calling a cudaMemcpy after a cuda graph launch CUDA Programming and Performance	7	545	October 19, 2023
Syncing Mapped Memory (cudaHostAllocMapped) after cudaMemcpy(Device-Device) CUDA Programming and Performance	6	8065	January 11, 2011
cudaMemcpyAsync, unexpected behaviour while using cudaStreamNonBlocking? CUDA Programming and Performance	6	2070	May 29, 2018
No Performance Improvement from Overlapping Kernel/Memcpy CUDA Programming and Performance	16	3144	July 14, 2010
Confusion about synchronization or asynchronization of cudaMemcpy() and cudaMemcpyAsync() CUDA Programming and Performance	5	3462	December 23, 2023

Are cudaMemCpy and cudaMalloc blocking/synchronous?

Related topics