Is cudaMemcpyAsync inside a kernel controlled by the GPU?

Hey experts!

I have this code snippet which copies data from the CPU to the GPU from within a kernel.

__global__ void kernel( int* host_data, int* device_data, size_t size )
    cudaMemcpyAsync( host_data, device_data, size * sizeof( int ), cudaMemcpyDefault );

I was wondering whether the GPU instantiates the transfer? I.e The GPU tells its DMA engines to transfer the data to the GPU without telling the CPU, or does the GPU just tell the CPU that a cudaMemcpyAsync call was invoked and then the copy works as per usual?