Confusion about synchronization or asynchronization of cudaMemcpy() and cudaMemcpyAsync()

Dear all,

I want to learn more details about the cudaMemcpy() and cudaMemcpyAsync().
After reading the Memcpy section of API synchronization behavior, I have two questions.

Q1: In the Memcpy section of the documentation, it says

Synchronous

  1. For transfers from pageable host memory to device memory, a stream sync is performed before the copy is initiated. The function will return once the pageable buffer has been copied to the staging memory for DMA transfer to device memory, but the DMA to final destination may not have completed.
  2. For transfers from pinned host memory to device memory, the function is synchronous with respect to the host.
    …

Does the the function is synchronous with respect to the host. mean the host will wait until DMA finishes the transfer? If it doesn’t, can the function return to the host directly because the host memory is already the pinned memory?

Q2: And then in the Memcpy section of the documentation, it says

Asynchronous

  1. For transfers between device memory and pageable host memory, the function might be synchronous with respect to host.
    …

Does it mean The function will return once the pageable buffer has been copied to the staging memory for DMA transfer to device memory, but the DMA to final destination may not have completed. which behave same as the point one in Synchronous section?

Any help will be appreciated.

Jack

1 Like

Yes. Generally speaking, when you do a cudaMemcpy operation, that is blocking the CPU thread (i.e. synchronous with respect to the host) until the transfer operation has completed.

Yes, when you do a cudaMemcpyAsync, and one of the pointers you pass to it refers to pageable host memory, then the transfer may also be a blocking transfer with respect to the host thread, i.e. the same as the previous description.

With respect to this second question, you should note the use of the word “may”. That is a possibly important distinction. Even if you pass a pageable buffer, it is possible for the transfer to be asynchronous with respect to the host, or stated another way, it is possible for the host thread to be released even before the transfer is complete.

Thank you!
I still have a little confusion about these two questions.
Q1:

Both copying from pageable host memory to device and copying from pinned host memory to device cost the host to wait but they are different shown as below.

  • Copying from pageable host memory to device
    • The function will return once the pageable buffer has been copied to the staging memory for DMA, but the DMA to final destination may not have completed. (based on the documentation)
  • Copying from pinned host memory to device
    • The function will return until DMA completes the transfer operation.(based on the documentation and the above answers)

Is my understanding correct?

Q2:

I know the asynchronized API can help us overlap the GPU kernel operation and the data transfer operation.
I am a newbie to CUDA. Should I always keep in mind which type of parameters will cause the host to synchronize? Could you give some tips on when should I deliberately focus on the synchronization problem of API?

Jack

certainly if you quote the documentation, I wouldn’t argue with that. I would question why this point would ever matter. I don’t spend time trying to remember such things. There isn’t anything you can do as a CUDA programmer that would allow you to meaningfully observe a difference here, other than measurements of elapsed time, I guess.

With respect to cudaMemcpy, it always has blocking and synchronizing behavior:

blocking: causes the host thread to wait there until the transfer is complete

synchronizing: does not begin until all previously issued work to the device is complete

cudaMemcpy, since it does not accept a stream argument, always has null stream semantics. A solid grasp of null stream semantics will help to cement the ideas I’ve stated already.

In my view the difference you pointed out above isn’t important to grasp or remember.

When using cudaMemcpyAsync, I would always use pinned buffers. If you’re not using pinned buffer, I see little reason to use cudaMemcpyAsync. If you do use pinned buffers, then it should behave just as you would expect from stream semantics. The operation is fully asynchronous both with respect to host and device, and will actually execute (i.e. transfer data) when stream semantics permit.

I recommend unit 7 of this online course (or perhaps all units up through unit 7) for further background.

I can’t distinguish which information is critical and whether affects the following CUDA learning or not. I am sorry if I waste common resources on useless questions.

Based on the above learning, I want to make some conclusions.
Q1: For the synchronous sections of cudaMemcpy, I need to follow the documentation’s literal statements and not be trapped in some details. At least for now, I am a newbie.
Q2: For the asynchronous sections of cudaMemcpyAsync. If I use pinned host memory, it is asynchronous to the host. But if I use pageable host memory, the behavior is uncertain. There is little reason to use pageable host memory with cudaMemcpyAsync.

Then, I need to learn more from the online course.

Thank you, Robert. You help me a lot.

Jack

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.