I have been using a combination of both Mapped Memory (hostAllocMapped) and Device Memory for host to device data transfer. Using this method, all the data transfer is through memcpy(Device-to-Device) and it gives me better performance.
The problem I am facing is in the following pseudo code,
CUDA_CHECK(cudaMemcpy(<devicemem1>, <mapmem1_deviceptr>, size, cudaMemcpyDeviceToDevice));
CUDA_CHECK(cudaMemcpy(<devicemem2>, <mapmem2_deviceptr>, size, cudaMemcpyDeviceToDevice));
KERNEL_CALL(<devicemem1>,<devicemem2>,<devicemem3>);
cudaThreadSynchronize(); // CU_CTX_BLOCKING_SYNC (blocking sync)
CUDA_CHECK(cudaMemcpy(<mapmem3_deviceptr>, <devicemem3>, size, cudaMemcpyDeviceToDevice));
PROCESS(<mapmem3_hostptr>); // Processing on CPU with host pointer to mapped memory
NOTE: CUDA_CHECK is just a macro which asserts on return code.
The PROCESS() call seems to work with STALE data, ie., the sync to host address space doesn’t seem to be complete. Is there a way I can enforce the syncing of memory to take place ?
To solve the problem, I used another cudaThreadSynchronize() before PROCESS() call and that seems to work, but I don’t need a BLOCKING sync in this case. That brings us to my other question, Is there a way to perform a NON BLOCKING sync when the context is created with CU_CTX_BLOCKING_SYNC ?
I don’t quite see how PROCESS(<mapmem3_hostptr>) can possibly work on the correct data without a blocking sync before it. After all the only reason for blocking is to wait for arrival of the correct data. Can you elaborate?
This is actually a known bug that I haven’t gotten around to fixing:
cudaMemcpy is blocking except for device-to-device transfers, where there’s no reason to make it blocking in the normal case. However, that’s not true for mapped memory. For the time being, the workaround is to either record an event after the DtoD memcpy to mapped memory and synchronize on that event or use a DtoH memcpy (there should be no difference in copy performance between the two).
Since, I found DtoD memcpy to be faster than using the DtoH memcpy, I went ahead and fell into the trap. Using a sync, of course, solves the problem with no difference in performance, exactly like you had mentioned.
0/
PS: Sorry about the delayed response. I just enabled the email notifications. I had thought it was enabled by default.