cudaMemcpyAsync problem

yefu.chen · September 5, 2019, 3:34am

hi:
I try to profile my cuda program with gperf-toosl, and I find that cudaMemcpyAsync consumes nearly 44% of total time.
The profile file shows me that cudaMemcpyAsync calls cudaGetExportTable, and nealy 99% of time is consumed by cudaGetExportTable.
I thought cudaMemcpyAsync would return immediately, and do the copy work in the background.
I want to know why cudaMemcpyAsync takes so much time, and is there any idea to improve it?

Thanks.

Robert_Crovella · September 5, 2019, 4:38am

Not necessarily. The documentation covers the cases when it is not asynchronous.

yefu.chen · September 5, 2019, 5:06am

thanks for your reply, can you show me the document when cudaMemcpyAsync is not asynchronous?
this site doesn’t tell me the cases when it is not aynchronous.

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1g85073372f776b4c4d5f89f7124b7bf79

I call the cudaMemcpyAsync with non-default stream.

yefu.chen · September 5, 2019, 5:12am

OK, I find the case when cudaMemcpyAsync is not asynchronous
https://docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behavior.html#api-sync-behavior__memcpy-async

Asynchronous

For transfers from device memory to pageable host memory, the function will return only once the copy has completed.

For transfers from any host memory to any host memory, the function is fully synchronous with respect to the host.

For all other transfers, the function is fully asynchronous. If pageable memory must first be staged to pinned memory, this will be handled asynchronously with a worker thread.

===========

In my case, I use cudaMemcpyAsync to copy host memory to device memory, It should work asynchronously.

Robert_Crovella · September 5, 2019, 12:40pm

When I do host->device copy with cudaMemcpyAsync, on a non-default stream, and the host memory is pinned, the copy is fully asynchronous

zhuguoyu29 · May 14, 2020, 9:51pm

for a pageable host memory, is cudaMemcpyAsync synchronous? supposedly cuda driver can use cpu to copy the pageable host memory to a staging pinned memory then do an asynchronous copy with gpu, then cudaMemcpyAsync still can be asynchronous even for pageable host memory, am i right?

luis.leon · May 18, 2020, 8:52pm

In principle, it is not. However, it may happen you have implicit synchronisation when using streams, for example.

You can have a look at this section of the CUDA documentation:

zhuguoyu29 · May 18, 2020, 11:19pm

in fact, this link you posted brings up another question to me, which is why below 4 actions would cause implicit sync:

a page-locked host memory allocation,
a device memory allocation,
a device memory set,
a memory copy between two addresses to the same device memory,

it seems memory operation would fully synchronize two concurrent streams, but why would that happen, after all, these are just memory allocation or set a value to a block memory.

thanks

luis.leon · May 19, 2020, 8:15am

Hi @zhuguoyu29

Some of the points can be explained by the architecture point of view:

a page-locked host memory allocation:

I think it is more likely to be in cases where we have Unified Memory, where should be checks for coherency in both: CPU and GPU.

a device memory allocation

Same as before, but let’s add also that imagine that we have many threads trying to allocate 100 MB each in the Global Memory. There should be a control which avoids memory overlaps or reserving the same memory for more than one thread.

a device memory set

I am afraid I don’t understand enough this point.

a memory copy between two addresses to the same device memory

Let’s suppose the case that you want to write to the Global Memory. The write request passes by a Memory Controller to avoid the multi-ports touching the same memory address and, thus, avoid catastrophic logic conflicts (a.k.a short circuits) in the memory. At the end, these memories are electronic circuits.

Perhaps, the question is: how can I avoid synchronisation?

Allocate memory before processing: use host side allocations or try to exploit the threads in processing to minimise the footprint of allocation.
Avoid access congestion: in principle, the best way that one thread can interact with the memory is by using a so-called: “coalesced access”, where you have a contiguous chunk of data and each thread is in charge of touching just an element of it.

Hope this helps you.

Leon.

zhuguoyu29 · May 26, 2020, 6:44pm

Thanks, @luis.leon. Appreciate it

Topic		Replies	Views
Synchronization of cudaMemcpyAsync for pageable memory CUDA Programming and Performance	2	1618	October 3, 2021
Confusion about synchronization or asynchronization of cudaMemcpy() and cudaMemcpyAsync() CUDA Programming and Performance	5	2877	December 23, 2023
CPU blocked MUCH longer than expected calling a cudaMemcpy after a cuda graph launch CUDA Programming and Performance	7	518	October 19, 2023
cudaMemcpyAsync +cudaDeviceSynchronize lead to lots of gpu page fault CUDA Programming and Performance	10	1322	February 19, 2019
cudaMemcpyAsync Device to Host : Need to synchronize before using data on host CUDA Programming and Performance	7	2173	October 7, 2022
Is cudaMemcpy() real-time safe? CUDA Programming and Performance cuda	11	497	March 30, 2024
Memory copy/set async to kernel execution in different stream CUDA Programming and Performance	5	985	December 15, 2022
cudaMemcpyAsync cpu Load? CUDA Programming and Performance cuda	2	557	April 24, 2023
Is cudaMemcpyAsync inside a kernel controlled by the GPU? CUDA Programming and Performance	9	3376	July 28, 2019
cudaHostRegister returns cudaErrorInvalidValue CUDA Programming and Performance	14	2685	January 28, 2021

cudaMemcpyAsync problem

Related topics