I have been using a combination of both Mapped Memory (hostAllocMapped) and Device Memory for host to device data transfer. Using this method, all the data transfer is through memcpy(Device-to-Device) and it gives me better performance.
The problem I am facing is in the following pseudo code,
CUDA_CHECK(cudaMemcpy(<devicemem1>, <mapmem1_deviceptr>, size, cudaMemcpyDeviceToDevice)); CUDA_CHECK(cudaMemcpy(<devicemem2>, <mapmem2_deviceptr>, size, cudaMemcpyDeviceToDevice)); KERNEL_CALL(<devicemem1>,<devicemem2>,<devicemem3>); cudaThreadSynchronize(); // CU_CTX_BLOCKING_SYNC (blocking sync) CUDA_CHECK(cudaMemcpy(<mapmem3_deviceptr>, <devicemem3>, size, cudaMemcpyDeviceToDevice)); PROCESS(<mapmem3_hostptr>); // Processing on CPU with host pointer to mapped memory
NOTE: CUDA_CHECK is just a macro which asserts on return code.
The PROCESS() call seems to work with STALE data, ie., the sync to host address space doesn’t seem to be complete. Is there a way I can enforce the syncing of memory to take place ?
To solve the problem, I used another cudaThreadSynchronize() before PROCESS() call and that seems to work, but I don’t need a BLOCKING sync in this case. That brings us to my other question, Is there a way to perform a NON BLOCKING sync when the context is created with CU_CTX_BLOCKING_SYNC ?