Does cudaMemcpyAsync() when called from the GPU's kernel-function use DMA or use GPU-cores?

AlexeyAB · October 12, 2013, 1:19pm

The following link contains the following code:
http://on-demand.gputechconf.com/gtc/2012/presentations/S0338-GTC2012-CUDA-Programming-Model.pdf

__device__ float buf[1024]; 
__global__ void dynamic(float *data) 
{ 
 int tid = threadIdx.x; 
 if(tid % 2) 
 buf[tid/2] = data[tid]+data[tid+1]; 
 __syncthreads(); 
 
 if(tid == 0) { 
 launch<<< 128, 256 >>>(buf); 
 cudaDeviceSynchronize(); 
 } 
 __syncthreads(); 
 
 cudaMemcpyAsync(data, buf, 1024); 
 cudaDeviceSynchronize(); 
}

In this case, the copy function cudaMemcpyAsync() when called from the GPU’s kernel-function does use DMA or use GPU-cores with dynamic parallelism (calls another kernel-function and copies using the GPU-cores)?