Is cudaMemcpyAsync inside a kernel controlled by the GPU?

Hey experts!

I have this code snippet which copies data from the CPU to the GPU from within a kernel.

__global__ void kernel( int* host_data, int* device_data, size_t size )
{
    cudaMemcpyAsync( host_data, device_data, size * sizeof( int ), cudaMemcpyDefault );
    cudaDeviceSynchronize();
}

I was wondering whether the GPU instantiates the transfer? I.e The GPU tells its DMA engines to transfer the data to the GPU without telling the CPU, or does the GPU just tell the CPU that a cudaMemcpyAsync call was invoked and then the copy works as per usual?

Thanks!

You can’t copy data from the CPU to the GPU from within a kernel using cudaMemcpyAsync.

This is covered in the programming guide:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#api-reference

Notes about all memcpy/memset functions:
Only async memcpy/set functions are supported
Only device-to-device memcpy is permitted

I have it working though with both paged and pinned host memory. How is this possible? Maybe they haven’t updated their documentaton?

It is workable with pinned memory. If by paged memory you mean memory allocated under UM, then that is also supported.

In the pinned case, it is handled identically to any other GPU access to pinned memory. In the paged case it is handled identically to any other GPU access to UM.

Oh okay - so it does just message the CPU to say perform a cudaMemcpyAsync?

no, that isn’t how it works in either case.

With pinned memory, host memory is actually mapped into device memory space. When the device accesses such memory, the device generates PCIE (or NVLink) transactions that get directly deposited into host memory (or else read from host memory and return it to the GPU global request). There is no programming of any DMA controllers in such a scenario.

With UM, it uses a demand-paging system, which is not the same as having the CPU program a DMA controller for an ordinary cudaMemcpyHostToDevice or cudaMemcpyDeviceToHost type of transfer:

https://devblogs.nvidia.com/parallelforall/cuda-8-features-revealed/

Ohh interesting… so a cudaMemcpyAsync or cudaMemcpy from the GPU is actually generating PCIe/NVLink transactions the same way zero copy does?

Hi, I know it’s been a while but I got a similar question. I tried to call cudaMemcpyAsync on pinned memory but it won’t even compile saying that I can’t use a host function inside a global function. I tried this on my old laptop , the details of which are given below. I checked the 8.0 cuda docs and the function is declared with host device @ https://docs.nvidia.com/cuda/archive/8.0/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1g85073372f776b4c4d5f89f7124b7bf79.

I understand this is still not a mainstream or recommended method. I just want to know why this is happening.

gpu:
geforce 820M 2GB (cc 2.1)

nvcc version:
nvcc: NVIDIA ® Cuda compiler driver
Copyright © 2005-2016 NVIDIA Corporation
Built on Sat_Sep__3_19:05:48_CDT_2016
Cuda compilation tools, release 8.0, V8.0.44

Using functions like this in the device generally requires compiling for a compute capability of 3.5 or higher.

You should be able to use memcpy() in a similar fashion while compiling for your cc2.1 device. There should be no difference between memcpy() behavior and in-kernel cudaMemcpyAsync() behavior.

Thank you for the response. I tried both copying using memcpy and plain old for loop and they give same results with -O3 flag at least. The only difference is that the memcpy version is significantly slower for the first 2-5 kernel launches then manages to catch up with the for loop version by speding up slightly morein the long run. This can be a laptop thing since I’ve noticed running a CUDA program after a while takes more time possibly because the OS has to wake up the gpu while it was using only the integrated gpu. But this initial slowdown occurs independent of the time I run the program. So I’m guessing memcpy uses some functions that requires some time to wake up or get cached on initial call.