Questions about when using cudaMemcpyAsync(), the host is blocked

Dear all, I have met a problem when I am trying to use multiple different streams in different thread (in the same process).

When I test the processing time of cudaMemcpyAsync(), I found that sometimes the host is blocked with this functions. Why could this situation happens? And how can I avoid this situation?

Thanks a lot.

cudaMemcpyAsync will be blocking if it is doing a H->D or D->H transfer, and the corresponding H allocation is not a pinned allocation.

There may be other reasons as well:

http://docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behavior.html#api-sync-behavior

Dear txbob,
Thanks for your quick reply. In all the threads I use cudaMemcpyAsync I have pinned memory with Host. but the host is stilling blocked.

And I read some materials from other places which said as follows:

cudaMemcpyAsync (host2device,stream1) 1
cudaMemcpyAsync (host2device,stream1) 2
kernel<<<…,…,stream1>>> 3
cudaMemcpyAsync (device2host,stream1) 4

cudaMemcpyAsync (host2device,stream2) 5
cudaMemcpyAsync (host2device,stream2) 6
kernel<<<…,…,stream2>>> 7
cudaMemcpyAsync (device2host,stream2) 8

It said that the operation 5 will be blocked because the operation 3 is not completed, Is it true?

Not that I know of. It looks like a depth-first launch strategy to me. In general I don’t know of any problems with that strategy on Kepler or newer GPUs.

If you believe that claim, it should be trivial to put together a simple test case, and prove or disprove.

Also if you are running on windows WDDM, that may be the culprit, depending on what you are looking at exactly or how you are making the determination that the host is blocked on a particular call.

For complex concurrency scenarios, I never recommend trying to test for proper overlap behavior in a WDDM setup. The WDDM system gets in the way, in many cases.

Dear txbob,

Thanks for your explanation.
In my project, I use ubuntu 16.06 and GTX1080 Ti.

I have multiple threads to concurrently copy data, launch kernel, copy data back using different streams.

And all host memory copyed to device are pinned memory.

However, the memcpy operation was blocked sometimes, and the waiting time is a little long which I really want to avoid.

And I also find that when I just use CudaMallocHost, the blocking time of copy is longer than when I use malloc first and than use cudaHostRgeigster() to pin this memory. Why could this situation happens, is it reasonable?

Thanks again for your reply.

Best Regards

Is the GTX 1080Ti driving a display?
If so, although the scenario is not the same as WDDM command batching, its possible that display tasks are inserting into your computation timeline, and disrupting the pattern you expect to see.

Other than that, I’m not aware of what differences there may be between malloc/cudaHostRegister, and cudaHostAlloc. I would expect them to behave similarly.

Dear txbob,
My project is running on ubuntu server, which doesn’t drive a display.

Best Regards