Confusion about synchronization or asynchronization of cudaMemcpy() and cudaMemcpyAsync()

huangjk5 · December 21, 2023, 8:25am

Dear all,

I want to learn more details about the cudaMemcpy() and cudaMemcpyAsync().
After reading the Memcpy section of API synchronization behavior, I have two questions.

Q1: In the Memcpy section of the documentation, it says

Synchronous

For transfers from pageable host memory to device memory, a stream sync is performed before the copy is initiated. The function will return once the pageable buffer has been copied to the staging memory for DMA transfer to device memory, but the DMA to final destination may not have completed.

For transfers from pinned host memory to device memory, the function is synchronous with respect to the host.
…

Does the the function is synchronous with respect to the host. mean the host will wait until DMA finishes the transfer? If it doesn’t, can the function return to the host directly because the host memory is already the pinned memory?

Q2: And then in the Memcpy section of the documentation, it says

Asynchronous

For transfers between device memory and pageable host memory, the function might be synchronous with respect to host.
…

Does it mean The function will return once the pageable buffer has been copied to the staging memory for DMA transfer to device memory, but the DMA to final destination may not have completed. which behave same as the point one in Synchronous section?

Any help will be appreciated.

Jack

Robert_Crovella · December 21, 2023, 2:18pm

Yes. Generally speaking, when you do a cudaMemcpy operation, that is blocking the CPU thread (i.e. synchronous with respect to the host) until the transfer operation has completed.

Yes, when you do a cudaMemcpyAsync, and one of the pointers you pass to it refers to pageable host memory, then the transfer may also be a blocking transfer with respect to the host thread, i.e. the same as the previous description.

With respect to this second question, you should note the use of the word “may”. That is a possibly important distinction. Even if you pass a pageable buffer, it is possible for the transfer to be asynchronous with respect to the host, or stated another way, it is possible for the host thread to be released even before the transfer is complete.

huangjk5 · December 22, 2023, 1:05am

Thank you!
I still have a little confusion about these two questions.
Q1:

Both copying from pageable host memory to device and copying from pinned host memory to device cost the host to wait but they are different shown as below.

Copying from pageable host memory to device
- The function will return once the pageable buffer has been copied to the staging memory for DMA, but the DMA to final destination may not have completed. (based on the documentation)
Copying from pinned host memory to device
- The function will return until DMA completes the transfer operation.(based on the documentation and the above answers)

Is my understanding correct?

Q2:

I know the asynchronized API can help us overlap the GPU kernel operation and the data transfer operation.
I am a newbie to CUDA. Should I always keep in mind which type of parameters will cause the host to synchronize? Could you give some tips on when should I deliberately focus on the synchronization problem of API?

Jack

Robert_Crovella · December 22, 2023, 2:57pm

certainly if you quote the documentation, I wouldn’t argue with that. I would question why this point would ever matter. I don’t spend time trying to remember such things. There isn’t anything you can do as a CUDA programmer that would allow you to meaningfully observe a difference here, other than measurements of elapsed time, I guess.

With respect to cudaMemcpy, it always has blocking and synchronizing behavior:

blocking: causes the host thread to wait there until the transfer is complete

synchronizing: does not begin until all previously issued work to the device is complete

cudaMemcpy, since it does not accept a stream argument, always has null stream semantics. A solid grasp of null stream semantics will help to cement the ideas I’ve stated already.

In my view the difference you pointed out above isn’t important to grasp or remember.

When using cudaMemcpyAsync, I would always use pinned buffers. If you’re not using pinned buffer, I see little reason to use cudaMemcpyAsync. If you do use pinned buffers, then it should behave just as you would expect from stream semantics. The operation is fully asynchronous both with respect to host and device, and will actually execute (i.e. transfer data) when stream semantics permit.

I recommend unit 7 of this online course (or perhaps all units up through unit 7) for further background.

huangjk5 · December 23, 2023, 2:40am

I can’t distinguish which information is critical and whether affects the following CUDA learning or not. I am sorry if I waste common resources on useless questions.

Based on the above learning, I want to make some conclusions.
Q1: For the synchronous sections of cudaMemcpy, I need to follow the documentation’s literal statements and not be trapped in some details. At least for now, I am a newbie.
Q2: For the asynchronous sections of cudaMemcpyAsync. If I use pinned host memory, it is asynchronous to the host. But if I use pageable host memory, the behavior is uncertain. There is little reason to use pageable host memory with cudaMemcpyAsync.

Then, I need to learn more from the online course.

Thank you, Robert. You help me a lot.

Jack

Topic		Replies	Views
cudaMemcpyAsync problem CUDA Programming and Performance	9	3388	May 26, 2020
cudaMemcpyAsync Device to Host : Need to synchronize before using data on host CUDA Programming and Performance	7	2589	October 7, 2022
Synchronization of cudaMemcpyAsync for pageable memory CUDA Programming and Performance	2	1958	October 3, 2021
CPU blocked MUCH longer than expected calling a cudaMemcpy after a cuda graph launch CUDA Programming and Performance	6	760	October 5, 2023
cudaMemcpyAsync waiting for another unrelated cudaMemcpyAsync CUDA Programming and Performance cuda	10	253	December 10, 2024
Clarification on cudaMemcpy synchronization behavior with pageable memory and non-blocking streams CUDA Programming and Performance specifications	6	236	January 19, 2026
About pageable memory to cudaMemcpyAsync CUDA Programming and Performance	2	61	April 19, 2026
Execution mode question: asynchronous or synchronous CUDA Programming and Performance	4	1504	January 26, 2011
Questions about when using cudaMemcpyAsync(), the host is blocked CUDA Programming and Performance	6	3771	April 5, 2018
cudaMemcpyAsync, unexpected behaviour while using cudaStreamNonBlocking? CUDA Programming and Performance	6	2370	May 29, 2018

Confusion about synchronization or asynchronization of cudaMemcpy() and cudaMemcpyAsync()

Related topics