CPU blocked MUCH longer than expected calling a cudaMemcpy after a cuda graph launch

Hi, I’m facing an issue on CPU thread when I call this two functions sequentially:
cudaGraphLaunch(…, streamX);
cudaMemcpy2DAsync(…, streamX);

the first call is blocking the CPU for 15/20us as expected (there is a NVIDIA presentation showing similar numbers), the second one is blocking the CPU for ~200us which is extremely higher than expected, it’s an ASYNC function which is actually not performing any data transfer (I’m just asking the GPU to transfer data when CUDA graph is completed)

If I put a usleep(50) between those calls, the waiting time for cudaMemcpy reduces by 50/100us.
My theory is that CUDA driver is “blocking” all cuda requests after the cudaGraphLaunch for many microseconds, am I right? If confirmed, this is a quite concerning issue because we cannot be sure about the timings of any cuda method call from CPU side. Did I miss something on the documentation about this topic?

(p.s. this is not related to Memcpy2D, I have the same problem with 1D memcpy)


cudaMemcpy*Async is a blocking operation when the source or destination is pageable host memory. See CUDA Driver API :: CUDA Toolkit Documentation

I would check this first, and if it is pinned memory then take a look at the profiler timeline of nsight systems.

1 Like

do you mean a memcpyAsync is sometimes Async and sometimes Sync depending on the arguments?
I’m in this situation:
" For transfers from device to either pageable or pinned host memory, the function returns only once the copy has completed."

This is a major issue on the API, it should be pointed out much clearly, there should be a warning, a runtime check… it’s not possible to have an “Async” function which is not Async !


That quote you excerpted is from the Synchronous section. So unless you are actually issuing cudaMemcpy instead of cudaMemcpyAsync, that quote (and section) do not apply.

The one that may apply if you are executing cudaMemcpyAsync is:

  1. For transfers between device memory and pageable host memory, the function might be synchronous with respect to host.

Thank you for the clarification, anyway all those “might” and “should” are worrying me a little.

What if I’m copying from device memory to pinned host memory? it’s below point 4 " For all other transfers, the function should be fully asynchronous." ? Is there a way to check for that “should”?

What about HOST => DEVICE async copy, is it always async?

Thank you

In general any transfer between pinned host memory, and device memory, using cudaMemcpyAsync with a properly created stream, should be fully asynchronous:

  • will obey stream semantics (i.e. asynchronous with respect to other streams)
  • does not block the host CPU thread (does not cause the host CPU thread to wait for the transfer to finish)

This is true for either direction (host to device, or device to host).

If you instead use pageable memory, all bets are off. It could be both synchronizing and blocking. It will still obey stream semantics, in a narrow sense, but may not run asynchronously with respect to other stream activity, like you might expect.

Not sure what you mean exactly. Use a profiler, I guess.

There are numerous forum questions both on this forum and others, that cover this topic. There is organized training available (session 7, CUDA Concurrency). And the limitations are mentioned in multiple places in the documentation.

1 Like

I’m using memcpyAsync from GPU to HOST, with pinned memory on both sides.

I fixed my problem by using cudaHostRegister() on the pinned CPU memory because it was allocated by another process and (I suppose!) CUDA requires HostRegister to know that is pinned.

Now transfer is twice faster and memcpy is async!
Thanks for your help

p.s. I think memcpyAsync should return an error when it’s not running asynchronously. I suppose it’s not possible but maybe you can add a new optional flag to memcpyAsync to explicitly ask for an error when not running async.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.