Memory Transfer

I just started with cuda programming and the first kernels are working.

yesterday i made some performance tests and please look at the timings:

     MemCopy -> Dev. time:  143.943 (ms)
     Cuda Processing time:    0.058 (ms)
    MemCopy Back    time: 7486.972 (ms)

Why is the the MemCopyBack so slow ?
I’am using CUDA 2.0 on 9600 Gt.

Thank you for any help.


Partic_Core.txt (3.42 KB)

Since you do not explictly synchronize what you measured is:

MemCopy -> Dev. time: 143.943 (ms)

     Cuda kernel launch time:    0.058 (ms)

    Processing and MemCopy Back    time: 7486.972 (ms)

see my first attachment for whole Code.

I don’t understand, why the Transfer from the GPU to the CPU for the
same amount of objects is about 52x slower than CPU to Gpu:


MemCopy -> Dev. time: 143.943 (ms)
Cuda Processing time: 0.058 (ms)
MemCopy Back time: 7486.972 (ms)

global void cu_calcEnergie( Energie En, int np, float Ka )
int j, x, y, n, n_gotEn;
x= blockIdx.x
y= blockIdx.yBLOCK_SIZE+threadIdx.y;
n= y
np+ x;


void calcPartic( Energie *En, int np, float Ka, int Ro )
cudaError_t result;
Energie *En_d;

    fprintf(stdout,"\tCuda2: np %d\n",np); fflush(stdout);
unsigned int  timer=0;
cutStartTimer  (timer);

result= cudaMalloc( (void**)&En_d, sizeof(Energie)*np );

if (result != cudaSuccess) { printf("cudaMalloc failed - En_d \n"); exit(1); }

result= cudaMemcpy( En_d, En, sizeof(Energie)*np, cudaMemcpyHostToDevice);

if (result != cudaSuccess) { printf("cudaMemcpy - Host-> GPU failed - En_d \n"); exit(1); }

printf(" MemCopy -> Dev. time: %8.3f (ms)\n",cutGetTimerValue(timer));

dim3 dimblock( BLOCK_SIZE,BLOCK_SIZE, 1); //  <512
	dim3 dimgrid ( np/(BLOCK_SIZE*BLOCK_SIZE)+1); //  (!) 1 , <65535

cu_calcEnergie<<<dimgrid,dimblock>>>( En_d, np, Ka );

printf(" Cuda Processing time: %8.3f (ms)\n",cutGetTimerValue(timer));

result= cudaMemcpy( En, En_d, sizeof(Energie)*np, cudaMemcpyDeviceToHost);

if (result != cudaSuccess) {
	printf(" \n *** cudaMemcpy GPU -> Host failed !\n");
printf(" MemCopy Back    time: %8.3f (ms)\n\n",cutGetTimerValue(timer));


Those call are asynchronous, so you don’t measure anything meaningful unless you make sure the memcpy is actually finished, as Reimar pointed out.



I guess you mean “the calculation” instead of “the memcpy”, the memcpys necessarily are finished when the function call returns. The kernel call though only starts execution on the GPU and the following memcpy must wait for it to complete before it can even start copying.

Indeed, thanks for the clarification !

I should not post that early in the morning :)

To be a bit more explicit than previous posters:

Because of the asynchronous calls to the GPU, you must precede any wall clock timing measurement with a call to cudaThreadSynchronize(). If you want more details, search the forums for cudaThreadSyncronize(), there have only been a few hundred threads on this subject…

Thanks’s for all your advices - now i see my faults.