Syncing Mapped Memory (cudaHostAllocMapped) after cudaMemcpy(Device-Device)

I have been using a combination of both Mapped Memory (hostAllocMapped) and Device Memory for host to device data transfer. Using this method, all the data transfer is through memcpy(Device-to-Device) and it gives me better performance.

The problem I am facing is in the following pseudo code,

CUDA_CHECK(cudaMemcpy(<devicemem1>, <mapmem1_deviceptr>, size, cudaMemcpyDeviceToDevice));
CUDA_CHECK(cudaMemcpy(<devicemem2>, <mapmem2_deviceptr>, size, cudaMemcpyDeviceToDevice));

cudaThreadSynchronize(); // CU_CTX_BLOCKING_SYNC (blocking sync)

CUDA_CHECK(cudaMemcpy(<mapmem3_deviceptr>, <devicemem3>, size, cudaMemcpyDeviceToDevice));

PROCESS(<mapmem3_hostptr>); // Processing on CPU with host pointer to mapped memory

NOTE: CUDA_CHECK is just a macro which asserts on return code.

The PROCESS() call seems to work with STALE data, ie., the sync to host address space doesn’t seem to be complete. Is there a way I can enforce the syncing of memory to take place ?

To solve the problem, I used another cudaThreadSynchronize() before PROCESS() call and that seems to work, but I don’t need a BLOCKING sync in this case. That brings us to my other question, Is there a way to perform a NON BLOCKING sync when the context is created with CU_CTX_BLOCKING_SYNC ?


Shouldn’t you be able to use cuStreamQuery ? or cuEventRecord/cuEventQuery ?

Thanks for the suggestion, using streams does solve the 2nd problem. Any idea whether there is a function to explicitly sync mapped memory ?


I don’t quite see how PROCESS(<mapmem3_hostptr>) can possibly work on the correct data without a blocking sync before it. After all the only reason for blocking is to wait for arrival of the correct data. Can you elaborate?

Actually I’m a bit surprised by the use of mapmem3_deviceptr and cudaMemcpyDeviceToDevice…

usually (see C/src/bandwidthTest/ in SDK) the host address is used instead, matched by a cudaMemcpyDeviceToHost.

In principle they should be the same… but could it be a suboptimal (mis)using of the GPU DMA controller ?

can you elaborate on why you are using this pattern?

This is actually a known bug that I haven’t gotten around to fixing:

cudaMemcpy is blocking except for device-to-device transfers, where there’s no reason to make it blocking in the normal case. However, that’s not true for mapped memory. For the time being, the workaround is to either record an event after the DtoD memcpy to mapped memory and synchronize on that event or use a DtoH memcpy (there should be no difference in copy performance between the two).

That explains it, Thanks.

Since, I found DtoD memcpy to be faster than using the DtoH memcpy, I went ahead and fell into the trap. Using a sync, of course, solves the problem with no difference in performance, exactly like you had mentioned.


PS: Sorry about the delayed response. I just enabled the email notifications. I had thought it was enabled by default.