When we use cudaMemcpy to copy data from device memory to system RAM, is it a DMA or CPU is involved in this transfer?


As my understanding, it is real copy data from device memory to host memory (RAM) because some reasons below.

  1. copy time depend on the size of data.

  2. you can use free device memory after the data has copied to host memory.

with DMA, these reasons may not satisfy.

The amount of CPU involvement also depends on whether the host memory block is pinned or not. If the memory is pinned, then the driver can issue a DMA request to the GPU to copy the data directly from host RAM without fear of the virtual memory systems moving the memory during the transfer. If the memory is pageable (as it is unless you use something like cudaMallocHost()), then the driver has to first copy chunks of your buffer into a private region of pinned memory, and start DMA transfers of that private buffer. This is why pinned and pageable memory have different transfer rates (unless you have a speedy X58 Core i7 system).

Do we also have similar situation when copying data from device to host. Is there any involvement of GPU when transferring data from Device to Host?

Device to Host is facilitated by the driver via PIO (Programmed IO) for un-pinned, pageable memories.
When the memory is known to be pinned and continuous, the driver would initiate a DMA and free up CPU cycles.