cudaMemcpyAsync problem

hi:
I try to profile my cuda program with gperf-toosl, and I find that cudaMemcpyAsync consumes nearly 44% of total time.
The profile file shows me that cudaMemcpyAsync calls cudaGetExportTable, and nealy 99% of time is consumed by cudaGetExportTable.
I thought cudaMemcpyAsync would return immediately, and do the copy work in the background.
I want to know why cudaMemcpyAsync takes so much time, and is there any idea to improve it?

Thanks.

Not necessarily. The documentation covers the cases when it is not asynchronous.

thanks for your reply, can you show me the document when cudaMemcpyAsync is not asynchronous?
this site doesn’t tell me the cases when it is not aynchronous.

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1g85073372f776b4c4d5f89f7124b7bf79

I call the cudaMemcpyAsync with non-default stream.

OK, I find the case when cudaMemcpyAsync is not asynchronous
https://docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behavior.html#api-sync-behavior__memcpy-async

Asynchronous

For transfers from device memory to pageable host memory, the function will return only once the copy has completed.

For transfers from any host memory to any host memory, the function is fully synchronous with respect to the host.

For all other transfers, the function is fully asynchronous. If pageable memory must first be staged to pinned memory, this will be handled asynchronously with a worker thread.

===========

In my case, I use cudaMemcpyAsync to copy host memory to device memory, It should work asynchronously.

When I do host->device copy with cudaMemcpyAsync, on a non-default stream, and the host memory is pinned, the copy is fully asynchronous

for a pageable host memory, is cudaMemcpyAsync synchronous? supposedly cuda driver can use cpu to copy the pageable host memory to a staging pinned memory then do an asynchronous copy with gpu, then cudaMemcpyAsync still can be asynchronous even for pageable host memory, am i right?

In principle, it is not. However, it may happen you have implicit synchronisation when using streams, for example.

You can have a look at this section of the CUDA documentation:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#implicit-synchronization

in fact, this link you posted brings up another question to me, which is why below 4 actions would cause implicit sync:

  • a page-locked host memory allocation,
  • a device memory allocation,
  • a device memory set,
  • a memory copy between two addresses to the same device memory,

it seems memory operation would fully synchronize two concurrent streams, but why would that happen, after all, these are just memory allocation or set a value to a block memory.

thanks

Hi @zhuguoyu29

Some of the points can be explained by the architecture point of view:

  1. a page-locked host memory allocation:

I think it is more likely to be in cases where we have Unified Memory, where should be checks for coherency in both: CPU and GPU.

  1. a device memory allocation

Same as before, but let’s add also that imagine that we have many threads trying to allocate 100 MB each in the Global Memory. There should be a control which avoids memory overlaps or reserving the same memory for more than one thread.

  1. a device memory set

I am afraid I don’t understand enough this point.

  1. a memory copy between two addresses to the same device memory

Let’s suppose the case that you want to write to the Global Memory. The write request passes by a Memory Controller to avoid the multi-ports touching the same memory address and, thus, avoid catastrophic logic conflicts (a.k.a short circuits) in the memory. At the end, these memories are electronic circuits.

Perhaps, the question is: how can I avoid synchronisation?

  1. Allocate memory before processing: use host side allocations or try to exploit the threads in processing to minimise the footprint of allocation.

  2. Avoid access congestion: in principle, the best way that one thread can interact with the memory is by using a so-called: “coalesced access”, where you have a contiguous chunk of data and each thread is in charge of touching just an element of it.

Hope this helps you.

Leon.

Thanks, @luis.leon. Appreciate it