Questions about when using cudaMemcpyAsync(), the host is blocked

xiaodongyee · April 5, 2018, 2:51pm

Dear all, I have met a problem when I am trying to use multiple different streams in different thread (in the same process).

When I test the processing time of cudaMemcpyAsync(), I found that sometimes the host is blocked with this functions. Why could this situation happens? And how can I avoid this situation?

Thanks a lot.

Robert_Crovella · April 5, 2018, 3:28pm

cudaMemcpyAsync will be blocking if it is doing a H->D or D->H transfer, and the corresponding H allocation is not a pinned allocation.

There may be other reasons as well:

[url]CUDA Runtime API :: CUDA Toolkit Documentation

xiaodongyee · April 5, 2018, 3:37pm

Dear txbob,
Thanks for your quick reply. In all the threads I use cudaMemcpyAsync I have pinned memory with Host. but the host is stilling blocked.

And I read some materials from other places which said as follows:

cudaMemcpyAsync (host2device,stream1) 1
cudaMemcpyAsync (host2device,stream1) 2
kernel<<<…,…,stream1>>> 3
cudaMemcpyAsync (device2host,stream1) 4

cudaMemcpyAsync (host2device,stream2) 5
cudaMemcpyAsync (host2device,stream2) 6
kernel<<<…,…,stream2>>> 7
cudaMemcpyAsync (device2host,stream2) 8

It said that the operation 5 will be blocked because the operation 3 is not completed, Is it true?

Robert_Crovella · April 5, 2018, 3:57pm

Not that I know of. It looks like a depth-first launch strategy to me. In general I don’t know of any problems with that strategy on Kepler or newer GPUs.

If you believe that claim, it should be trivial to put together a simple test case, and prove or disprove.

Also if you are running on windows WDDM, that may be the culprit, depending on what you are looking at exactly or how you are making the determination that the host is blocked on a particular call.

For complex concurrency scenarios, I never recommend trying to test for proper overlap behavior in a WDDM setup. The WDDM system gets in the way, in many cases.

xiaodongyee · April 5, 2018, 4:32pm

Dear txbob,

Thanks for your explanation.
In my project, I use ubuntu 16.06 and GTX1080 Ti.

I have multiple threads to concurrently copy data, launch kernel, copy data back using different streams.

And all host memory copyed to device are pinned memory.

However, the memcpy operation was blocked sometimes, and the waiting time is a little long which I really want to avoid.

And I also find that when I just use CudaMallocHost, the blocking time of copy is longer than when I use malloc first and than use cudaHostRgeigster() to pin this memory. Why could this situation happens, is it reasonable?

Thanks again for your reply.

Best Regards

Robert_Crovella · April 5, 2018, 5:21pm

Is the GTX 1080Ti driving a display?
If so, although the scenario is not the same as WDDM command batching, its possible that display tasks are inserting into your computation timeline, and disrupting the pattern you expect to see.

Other than that, I’m not aware of what differences there may be between malloc/cudaHostRegister, and cudaHostAlloc. I would expect them to behave similarly.

xiaodongyee · April 5, 2018, 11:41pm

Dear txbob,
My project is running on ubuntu server, which doesn’t drive a display.

Best Regards

Topic		Replies	Views
CPU blocked MUCH longer than expected calling a cudaMemcpy after a cuda graph launch CUDA Programming and Performance	7	484	October 19, 2023
Are cudaMemCpy and cudaMalloc blocking/synchronous? CUDA Programming and Performance	1	65	September 30, 2024
Async questions Kernels appear to stall host threads CUDA Programming and Performance	3	2256	January 20, 2008
Queueing device-to-device/peer memcpy stalls concurrent copy operations CUDA Programming and Performance	6	213	June 11, 2024
cudaMemcpyAsync clarification required & help needed CUDA Programming and Performance	0	1747	October 17, 2009
I want to synchronize CUDA streams CUDA Programming and Performance	5	532	January 5, 2024
Syncing Mapped Memory (cudaHostAllocMapped) after cudaMemcpy(Device-Device) CUDA Programming and Performance	6	8053	January 11, 2011
Interleaving cudaMalloc and kernels on multiple cpu threads - performance? CUDA Programming and Performance	6	1413	March 5, 2018
Memory copy/set async to kernel execution in different stream CUDA Programming and Performance	5	892	December 15, 2022
CUDA blocks all threads when doing a Device to Host MemCpyAsync to a pageable host memory location CUDA Setup and Installation	2	362	October 28, 2023

Questions about when using cudaMemcpyAsync(), the host is blocked

Related topics