Syncing Mapped Memory (cudaHostAllocMapped) after cudaMemcpy(Device-Device)

0throot · December 24, 2010, 10:13am

I have been using a combination of both Mapped Memory (hostAllocMapped) and Device Memory for host to device data transfer. Using this method, all the data transfer is through memcpy(Device-to-Device) and it gives me better performance.

The problem I am facing is in the following pseudo code,

CUDA_CHECK(cudaMemcpy(<devicemem1>, <mapmem1_deviceptr>, size, cudaMemcpyDeviceToDevice));
CUDA_CHECK(cudaMemcpy(<devicemem2>, <mapmem2_deviceptr>, size, cudaMemcpyDeviceToDevice));

KERNEL_CALL(<devicemem1>,<devicemem2>,<devicemem3>);
cudaThreadSynchronize(); // CU_CTX_BLOCKING_SYNC (blocking sync)

CUDA_CHECK(cudaMemcpy(<mapmem3_deviceptr>, <devicemem3>, size, cudaMemcpyDeviceToDevice));

PROCESS(<mapmem3_hostptr>); // Processing on CPU with host pointer to mapped memory

NOTE: CUDA_CHECK is just a macro which asserts on return code.

The PROCESS() call seems to work with STALE data, ie., the sync to host address space doesn’t seem to be complete. Is there a way I can enforce the syncing of memory to take place ?

To solve the problem, I used another cudaThreadSynchronize() before PROCESS() call and that seems to work, but I don’t need a BLOCKING sync in this case. That brings us to my other question, Is there a way to perform a NON BLOCKING sync when the context is created with CU_CTX_BLOCKING_SYNC ?

-0/

sergeyn · December 24, 2010, 3:10pm

Shouldn’t you be able to use cuStreamQuery ? or cuEventRecord/cuEventQuery ?

0throot · December 28, 2010, 5:09am

Thanks for the suggestion, using streams does solve the 2nd problem. Any idea whether there is a function to explicitly sync mapped memory ?

-0/

tera · December 28, 2010, 11:14am

I don’t quite see how PROCESS(<mapmem3_hostptr>) can possibly work on the correct data without a blocking sync before it. After all the only reason for blocking is to wait for arrival of the correct data. Can you elaborate?

d.rossetti · December 28, 2010, 2:14pm

CUDA_CHECK(cudaMemcpy(, <mapmem1_deviceptr>, size, cudaMemcpyDeviceToDevice));
CUDA_CHECK(cudaMemcpy(<devicemem2>, <mapmem2_deviceptr>, size, cudaMemcpyDeviceToDevice));
KERNEL_CALL(,,);
cudaThreadSynchronize(); // CU_CTX_BLOCKING_SYNC (blocking sync)
CUDA_CHECK(cudaMemcpy(<mapmem3_deviceptr>, , size, cudaMemcpyDeviceToDevice));

Actually I’m a bit surprised by the use of mapmem3_deviceptr and cudaMemcpyDeviceToDevice…

usually (see C/src/bandwidthTest/bandwidthTest.cu in SDK) the host address is used instead, matched by a cudaMemcpyDeviceToHost.

In principle they should be the same… but could it be a suboptimal (mis)using of the GPU DMA controller ?

can you elaborate on why you are using this pattern?

tmurray · December 28, 2010, 11:38pm

This is actually a known bug that I haven’t gotten around to fixing:

cudaMemcpy is blocking except for device-to-device transfers, where there’s no reason to make it blocking in the normal case. However, that’s not true for mapped memory. For the time being, the workaround is to either record an event after the DtoD memcpy to mapped memory and synchronize on that event or use a DtoH memcpy (there should be no difference in copy performance between the two).

0throot · January 11, 2011, 9:57am

That explains it, Thanks.

Since, I found DtoD memcpy to be faster than using the DtoH memcpy, I went ahead and fell into the trap. Using a sync, of course, solves the problem with no difference in performance, exactly like you had mentioned.

0/

PS: Sorry about the delayed response. I just enabled the email notifications. I had thought it was enabled by default.

Topic		Replies	Views
Are cudaMemCpy and cudaMalloc blocking/synchronous? CUDA Programming and Performance	1	147	September 30, 2024
cudaDeviceSynchronize needed between kernel launch and cudaMemcpy ? CUDA Programming and Performance	15	16082	September 29, 2017
CPU blocked MUCH longer than expected calling a cudaMemcpy after a cuda graph launch CUDA Programming and Performance	7	509	October 19, 2023
cudaThreadSynchronize() stalls application CUDA Programming and Performance	10	10988	November 17, 2009
strange behavior with device emulation CUDA Programming and Performance	5	2693	May 20, 2008
cudaMemcpyAsync clarification required & help needed CUDA Programming and Performance	0	1749	October 17, 2009
cudaMemcpyAsync problem CUDA Programming and Performance	9	3009	May 26, 2020
Questions about when using cudaMemcpyAsync(), the host is blocked CUDA Programming and Performance	6	3488	April 5, 2018
Trying to run cudaMemsetAsync in a more timely manner CUDA Programming and Performance	8	1252	September 15, 2019
cudaHostRegister returns cudaErrorInvalidValue CUDA Programming and Performance	14	2661	January 28, 2021

Syncing Mapped Memory (cudaHostAllocMapped) after cudaMemcpy(Device-Device)

Related topics