cudaMemcpy
is an operation issued to the NULL stream. Assuming no changes to NULL stream behavior, it has NULL stream semantics. That means that:
- all previously issued work (cudaMemcpyxxx, kernel launches) to that device must complete before the
cudaMemcpy
is allowed to begin, and
- no subsequently issued work to that device can begin until the cudaMemcpy operation is complete.
The above statements are true regardless of the direction-of-transfer specification. The above statements refer to scheduling of work on the device, not as directly to host activity.
A cudaMemcpy operation D->D (same device) is non-blocking to the host thread. If you want to prove this to yourself, a simple test can be constructed. Perform a large D->D transfer, time the CPU duration of the call, compare the indicated bandwidth to a plausible upper bound measurement number. (For example, I have done this just now on a L4 GPU. I did a 4GB transfer D2D - 4GB of read and 4GB of write, and measured 9 microseconds duration. Not a plausible transfer time, therefore the call must be non-blocking to the host thread.) But refer to above statements as to when it actually begins/ends.
cudaMemcpy with H->D or D->H direction specification is generally blocking to the host thread. Certainly in the D->H direction this is necessary for CUDA to behave correctly; it’s discoverable in the most basic introductory CUDA treatments.
cudaMalloc
generally originally required a cessation of device activity, notionally because it is modifying the GPU memory map. Therefore on the device side it manifests as a “gap” in device activity; not unlike an operation issued to the NULL stream. This is the “synchronization” used in the first link of yours. However with recent implementations of e.g. cudaMallocAsync
and default memory pools, this may have changed somewhat. I wouldn’t suggest depending on any specific behavior here, but you should acknowledge that it could be synchronizing and that might be “unwanted”.
With respect to host thread, I believe cudaMalloc
is blocking. Since no particular behavior is specified, you should assume that the behavior could be either blocking or non-blocking with respect to host thread, and make your code correct in spite of that. (I’m honestly not aware of how you might construct your code differently in either case, alternatively: “why does it matter?”). However, my advice to CUDA programmers (primarily because of the potential for device-side synchronization, as already mentioned) is to get any non-essential operations out of carefully crafted work-issuance loops. When it is impossible to remove cudaMalloc
-style operations from work issuance loops, then it may be worthwhile to investigate cudaMallocAsync
along with memory pools.
When teaching CUDA I generally try to use the word “synchronizing” to refer to effects on device activity, and “blocking” to refer to effects on host thread behavior; I hope I have not mixed anything up here. CUDA kernel launches are generally referred to as asynchronous; this has meaning for both host and device activity.