CPU blocked MUCH longer than expected calling a cudaMemcpy after a cuda graph launch

samrivaKS · October 4, 2023, 10:25am

Hi, I’m facing an issue on CPU thread when I call this two functions sequentially:
cudaGraphLaunch(…, streamX);
cudaMemcpy2DAsync(…, streamX);

the first call is blocking the CPU for 15/20us as expected (there is a NVIDIA presentation showing similar numbers), the second one is blocking the CPU for ~200us which is extremely higher than expected, it’s an ASYNC function which is actually not performing any data transfer (I’m just asking the GPU to transfer data when CUDA graph is completed)

If I put a usleep(50) between those calls, the waiting time for cudaMemcpy reduces by 50/100us.
My theory is that CUDA driver is “blocking” all cuda requests after the cudaGraphLaunch for many microseconds, am I right? If confirmed, this is a quite concerning issue because we cannot be sure about the timings of any cuda method call from CPU side. Did I miss something on the documentation about this topic?

(p.s. this is not related to Memcpy2D, I have the same problem with 1D memcpy)

Thanks

striker159 · October 4, 2023, 1:04pm

cudaMemcpy*Async is a blocking operation when the source or destination is pageable host memory. See CUDA Driver API :: CUDA Toolkit Documentation

I would check this first, and if it is pinned memory then take a look at the profiler timeline of nsight systems.

samrivaKS · October 4, 2023, 1:47pm

WOW
do you mean a memcpyAsync is sometimes Async and sometimes Sync depending on the arguments?
I’m in this situation:
" For transfers from device to either pageable or pinned host memory, the function returns only once the copy has completed."

This is a major issue on the API, it should be pointed out much clearly, there should be a warning, a runtime check… it’s not possible to have an “Async” function which is not Async !

Robert_Crovella · October 4, 2023, 1:59pm

Yes

That quote you excerpted is from the Synchronous section. So unless you are actually issuing cudaMemcpy instead of cudaMemcpyAsync, that quote (and section) do not apply.

The one that may apply if you are executing cudaMemcpyAsync is:

For transfers between device memory and pageable host memory, the function might be synchronous with respect to host.

samrivaKS · October 4, 2023, 2:35pm

Thank you for the clarification, anyway all those “might” and “should” are worrying me a little.

What if I’m copying from device memory to pinned host memory? it’s below point 4 " For all other transfers, the function should be fully asynchronous." ? Is there a way to check for that “should”?

What about HOST => DEVICE async copy, is it always async?

Thank you

Robert_Crovella · October 4, 2023, 2:40pm

In general any transfer between pinned host memory, and device memory, using cudaMemcpyAsync with a properly created stream, should be fully asynchronous:

will obey stream semantics (i.e. asynchronous with respect to other streams)
does not block the host CPU thread (does not cause the host CPU thread to wait for the transfer to finish)

This is true for either direction (host to device, or device to host).

If you instead use pageable memory, all bets are off. It could be both synchronizing and blocking. It will still obey stream semantics, in a narrow sense, but may not run asynchronously with respect to other stream activity, like you might expect.

Not sure what you mean exactly. Use a profiler, I guess.

There are numerous forum questions both on this forum and others, that cover this topic. There is organized training available (session 7, CUDA Concurrency). And the limitations are mentioned in multiple places in the documentation.

samrivaKS · October 5, 2023, 10:53am

I’m using memcpyAsync from GPU to HOST, with pinned memory on both sides.

I fixed my problem by using cudaHostRegister() on the pinned CPU memory because it was allocated by another process and (I suppose!) CUDA requires HostRegister to know that is pinned.

Now transfer is twice faster and memcpy is async!
Thanks for your help

p.s. I think memcpyAsync should return an error when it’s not running asynchronously. I suppose it’s not possible but maybe you can add a new optional flag to memcpyAsync to explicitly ask for an error when not running async.

system · October 19, 2023, 10:54am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Confusion about synchronization or asynchronization of cudaMemcpy() and cudaMemcpyAsync() CUDA Programming and Performance	5	2872	December 23, 2023
Synchronization of cudaMemcpyAsync for pageable memory CUDA Programming and Performance	2	1617	October 3, 2021
Memory copy/set async to kernel execution in different stream CUDA Programming and Performance	5	984	December 15, 2022
cudaMemcpyAsync problem CUDA Programming and Performance	9	3027	May 26, 2020
cudaMemcpyAsync clarification required & help needed CUDA Programming and Performance	0	1749	October 17, 2009
How does the cudaMemcpyAsync work with not page-locked memory? CUDA Programming and Performance	4	430	August 28, 2023
Questions about when using cudaMemcpyAsync(), the host is blocked CUDA Programming and Performance	6	3507	April 5, 2018
Are cudaMemCpy and cudaMalloc blocking/synchronous? CUDA Programming and Performance	1	214	September 30, 2024
Async Memcpy calls blocking main thread CUDA Programming and Performance	3	2445	November 19, 2011
Performance of memcpyasync CUDA Programming and Performance	2	1061	June 17, 2021

CPU blocked MUCH longer than expected calling a cudaMemcpy after a cuda graph launch

Related topics