Memcpy_async() to host memory

Can cooperative_groups::memcpy_async() be used to copy from/to host memory that was allocated with cudaMallocHost()?
I’ve read that the main purpose of this function is to copy between shared and global memory.
However, I thought that It might work in other scenarios as well due to this:

while this is a memcpy in the general case, it is only asynchronous if …


• Returns immediately after initiating copy
• host can do work while copy is performed
• only if pinned memory is used
• Copies in the same direction (i.e. H2D or D2H) are
• copies in opposite directions are concurrent if in different

Pinned Memory:
Pinned memory (or page-locked) memory will not be paged out to disk when memory runs low:
• the GPU can safely remotely read/write the memory directly without host involvement
• only use for transfers, because it easy to run out of memory

Yes, you can use pointers to pinned host memory with memcpy_async

Is memcpy_async() a good option in terms of performance?
Does the number of threads in the cooperative group effect the memcpy_async performance?
I don’t care about asynchronous behavior (and will call sync() right after memcpy_async()).

  • Lets say that I have a group of 32 threads (a warp) that need to copy a large array (1024 integers). will memcpy_async() have similar performance to regular copy of the array in a concurrent loop?

  • Now, Lets say that there is only one active thread (coalesced_threads().size() == 1). Is there a benefit in using memcpy_async() over regular copy using a loop?

Can you please share some insight about how memcpy_async() works internally?


The implementation of cg::memcpy_async is available in include/cooperative_groups/details/async.h

I would suggest to just go ahead and implement both a version without memcpy_async and a version with memcpy_async, and pick the faster one.