Memory Transfer

hopi2 · October 9, 2008, 5:37am

I just started with cuda programming and the first kernels are working.

yesterday i made some performance tests and please look at the timings:

     MemCopy -> Dev. time:  143.943 (ms)
     Cuda Processing time:    0.058 (ms)
    MemCopy Back    time: 7486.972 (ms)

Why is the the MemCopyBack so slow ?
I’am using CUDA 2.0 on 9600 Gt.

Thank you for any help.

   horst

Partic_Core.txt (3.42 KB)

Reimar · October 9, 2008, 6:20am

Since you do not explictly synchronize what you measured is:

MemCopy → Dev. time: 143.943 (ms)

     Cuda kernel launch time:    0.058 (ms)

    Processing and MemCopy Back    time: 7486.972 (ms)

hopi2 · October 9, 2008, 6:57am

see my first attachment for whole Code.

I don’t understand, why the Transfer from the GPU to the CPU for the
same amount of objects is about 52x slower than CPU to Gpu:

horst.

MemCopy → Dev. time: 143.943 (ms)
Cuda Processing time: 0.058 (ms)
MemCopy Back time: 7486.972 (ms)

global void cu_calcEnergie( Energie En, int np, float Ka )
{
int j, x, y, n, n_gotEn;
x= blockIdx.xBLOCK_SIZE+threadIdx.x;
y= blockIdx.yBLOCK_SIZE+threadIdx.y;
n= ynp+ x;
…
}

void calcPartic( Energie *En, int np, float Ka, int Ro )
{
cudaError_t result;
Energie *En_d;

    fprintf(stdout,"\tCuda2: np %d\n",np); fflush(stdout);
unsigned int  timer=0;
cutCreateTimer(&timer); 
cutStartTimer  (timer);

result= cudaMalloc( (void**)&En_d, sizeof(Energie)*np );

if (result != cudaSuccess) { printf("cudaMalloc failed - En_d \n"); exit(1); }

result= cudaMemcpy( En_d, En, sizeof(Energie)*np, cudaMemcpyHostToDevice);

if (result != cudaSuccess) { printf("cudaMemcpy - Host-> GPU failed - En_d \n"); exit(1); }

cutStopTimer(timer); 
printf(" MemCopy -> Dev. time: %8.3f (ms)\n",cutGetTimerValue(timer));
cutResetTimer(timer);
cutStartTimer(timer);

dim3 dimblock( BLOCK_SIZE,BLOCK_SIZE, 1); //  <512
	dim3 dimgrid ( np/(BLOCK_SIZE*BLOCK_SIZE)+1); //  (!) 1 , <65535

cu_calcEnergie<<<dimgrid,dimblock>>>( En_d, np, Ka );

cutStopTimer(timer); 
printf(" Cuda Processing time: %8.3f (ms)\n",cutGetTimerValue(timer));
cutResetTimer(timer);
cutStartTimer(timer);

result= cudaMemcpy( En, En_d, sizeof(Energie)*np, cudaMemcpyDeviceToHost);

if (result != cudaSuccess) {
	printf(" \n *** cudaMemcpy GPU -> Host failed !\n");
	exit(1);
}
cutStopTimer(timer); 
printf(" MemCopy Back    time: %8.3f (ms)\n\n",cutGetTimerValue(timer));
cudaFree(En_d);

}

gonnet · October 9, 2008, 7:04am

see my first attachment for whole Code.

I don’t understand, why the Transfer from the GPU to the CPU for the

same amount of objects is about 52x slower than CPU to Gpu:

horst.

MemCopy → Dev. time: 143.943 (ms)

Cuda Processing time: 0.058 (ms)

MemCopy Back time: 7486.972 (ms)

[Â …]
result= cudaMemcpy( En, En_d, sizeof(Energie)*np, cudaMemcpyDeviceToHost);

if (result != cudaSuccess) {
printf(" \n *** cudaMemcpy GPU → Host failed !\n");

exit(1);
}
cutStopTimer(timer);

[snapback]449552[/snapback]

Those call are asynchronous, so you don’t measure anything meaningful unless you make sure the memcpy is actually finished, as Reimar pointed out.

++

CÃ©dric

Reimar · October 9, 2008, 7:07am

I guess you mean “the calculation” instead of “the memcpy”, the memcpys necessarily are finished when the function call returns. The kernel call though only starts execution on the GPU and the following memcpy must wait for it to complete before it can even start copying.

gonnet · October 9, 2008, 9:54am

Indeed, thanks for the clarification !

I should not post that early in the morning :)

MisterAnderson42 · October 9, 2008, 1:47pm

To be a bit more explicit than previous posters:

Because of the asynchronous calls to the GPU, you must precede any wall clock timing measurement with a call to cudaThreadSynchronize(). If you want more details, search the forums for cudaThreadSyncronize(), there have only been a few hundred threads on this subject…

hopi2 · October 10, 2008, 6:15am

Thanks’s for all your advices - now i see my faults.

horst.

Topic		Replies	Views
DATA tranfer from CPU to GPU CUDA Programming and Performance	6	4927	April 23, 2008
[solved] strange cuda memcopy time CUDA Programming and Performance	5	831	March 26, 2015
About CUDA CUDA Programming and Performance	2	4779	December 3, 2008
copy memory slow? CUDA Programming and Performance	2	4873	February 12, 2009
cudaMemcpy host->device and device->host speed CUDA Programming and Performance	6	15542	April 29, 2014
cudaMemcpy too slow CUDA Programming and Performance	1	1180	May 11, 2021
Possibly Studpid question bout cudaMemcpy CudaMemcpy getting slow by time CUDA Programming and Performance	4	2118	February 26, 2010
influence of muti-threading in cudaMemCpy? Jetson TX2	6	812	October 26, 2018
1st and 2nd Memcopy timing details CUDA Programming and Performance	2	2385	June 15, 2009
very slow function next to kernel CUDA Programming and Performance	3	3999	August 10, 2008

Memory Transfer

Related topics