cudaMemcpy execution time

mediocre · June 17, 2010, 2:26pm

I have done a visual profiling on a parallel reduction code i have written recently, and this is the result:

I am not too experienced in interpreting cuda profiler results, but is it normal for the cudaMemcpy to take up more than 90% of the execution time? The array size that i attempt to reduce is 1048576. Is there any way to speed up the process, or did I use the cudaMemcpy function in the wrong way? The following are my reduction code:

[codebox]#include “mex.h”

#include “math.h”

#include “cuda.h”

#define BLOCK_SIZE 512

global void reduction0(floatx,floatpsum,int n)

{

__shared__ float sdata[BLOCK_SIZE];

unsigned int tid=threadIdx.x;

unsigned int i=blockIdx.x*2*blockDim.x+threadIdx.x;

unsigned int gridSize=BLOCK_SIZE*2*gridDim.x;

sdata[tid]=0;

while(i<n){

sdata[tid]+=x[i]+x[i+blockDim.x];

i+=gridSize;}

__syncthreads();

for(unsigned int stride=blockDim.x>>1;stride>32;stride>>=1)

{

    if(tid<stride)

    {

    sdata[tid]+=sdata[tid+stride];

    }

__syncthreads();

}

if(tid<32)

{

    sdata[tid]+=sdata[tid+32];

    sdata[tid]+=sdata[tid+16];

    sdata[tid]+=sdata[tid+8];

    sdata[tid]+=sdata[tid+4];

    sdata[tid]+=sdata[tid+2];

    sdata[tid]+=sdata[tid+1];

 }

if(tid==0)psum[blockIdx.x]=sdata[0];

}

void mexFunction(int nlhs, mxArray *plhs,int nrhs,const mxArray *prhs)

{

float* x,*dx;

float*sum,*dpsum;

int length,blockno;

int dim[2];

x=(float*)mxGetPr(prhs[0]);

length=mxGetN(prhs[0]);

dim[0]=1;

dim[1]=64;

plhs[0]=mxCreateNumericArray(2,dim,mxSINGLE_CLASS,mxREAL);

sum=(float*)mxGetData(plhs[0]);

cudaMalloc((void**)&dx,length*sizeof(float));

cudaMalloc((void**)&dpsum,64*sizeof(float));

cudaMemcpy(dx,x,length*sizeof(float),cudaMemcpyHostToDevice)

;

reduction0<<<64,BLOCK_SIZE>>>(dx,dpsum,length);

cudaMemcpy(sum,dpsum,64*sizeof(float),cudaMemcpyDeviceToHost

);

cudaFree(dx);

cudaFree(dpsum);    

return;

}[/codebox]

eyalhir74 · June 17, 2010, 2:45pm

reduction0<<<64,BLOCK_SIZE>>>(dx,dpsum,length);

	cudaMemcpy(sum,dpsum,64*sizeof(float),cudaMemcpyDeviceToHost

You don’t have the cudaThreadSync after the kernel code and therefore the memcpy will wait till

the kernel’s completion. This is why you probably see that 90% of the time is because of the memcpy, while

it is probably actually the kernel itself.

eyal

tera · June 17, 2010, 3:03pm

For a pure reduction this is to be expected, there is just not enough work in there to make up for the PCIe transfer time.

Reduction is purely memory bound, and most CUDA devices have a memory bandwidth that is well over 10x the PCIe bandwidth. So yes, it is to be expected that this example spends over 90% in the cudaMemcpy.

mediocre · June 17, 2010, 3:15pm

Thanks for the explanations. Does that mean that a serial version of simple reduction (like summation) is always faster than a GPU one? In what situation that a reduction algorithm is worth implemented in GPU?

mediocre · June 17, 2010, 3:24pm

Thanks for the explanations. Does that mean that a serial version of simple reduction (like summation) is always faster than a GPU one? In what situation that a reduction algorithm is worth implemented in GPU?

EDIT: For some reason my original reply doesnt seem to show up in the forum. I apologize if i double posted.

tera · June 17, 2010, 4:19pm

The reduction is worth doing on the GPU if the data already is on the GPU, e.g. because it has been generated there.

Topic		Replies	Views
cudaMemcpy() Best approach when you need to call it many times? CUDA Programming and Performance	8	25061	March 8, 2010
How much time is cudaMemcpy() use? CUDA Programming and Performance	1	4022	July 30, 2008
cudaMemcpy is slow the first time used in a loop CUDA Programming and Performance cuda	3	1652	October 12, 2021
cuda visual profiler CUDA Programming and Performance	12	8168	July 30, 2008
Memory Transfer CUDA Programming and Performance	7	2959	October 10, 2008
Is cudaMemcpy() real-time safe? CUDA Programming and Performance cuda	11	511	March 30, 2024
Background DMA making applications faster!? CUDA Programming and Performance	6	714	June 7, 2018
processing time check CUDA Programming and Performance	5	551	November 16, 2010
cudaMemcpyAsync makes code faster even when using the default stream 0 CUDA Programming and Performance	1	1402	January 10, 2022
Slow cudaMemcpy execution Tested in GTX480 and GT240 CUDA Programming and Performance	6	2245	April 25, 2012

cudaMemcpy execution time

Related topics