hopi2
October 9, 2008, 5:37am
#1
I just started with cuda programming and the first kernels are working.
yesterday i made some performance tests and please look at the timings:
MemCopy -> Dev. time: 143.943 (ms)
Cuda Processing time: 0.058 (ms)
MemCopy Back time: 7486.972 (ms)
Why is the the MemCopyBack so slow ?
I’am using CUDA 2.0 on 9600 Gt.
Thank you for any help.
horst
Partic_Core.txt (3.42 KB)
Reimar
October 9, 2008, 6:20am
#2
MemCopy -> Dev. time: 143.943 (ms)
Cuda Processing time: 0.058 (ms)
MemCopy Back time: 7486.972 (ms)
[snapback]449530[/snapback]
Since you do not explictly synchronize what you measured is:
MemCopy -> Dev. time: 143.943 (ms)
Cuda kernel launch time: 0.058 (ms)
Processing and MemCopy Back time: 7486.972 (ms)
hopi2
October 9, 2008, 6:57am
#3
see my first attachment for whole Code.
I don’t understand, why the Transfer from the GPU to the CPU for the
same amount of objects is about 52x slower than CPU to Gpu:
horst.
MemCopy -> Dev. time: 143.943 (ms)
Cuda Processing time: 0.058 (ms)
MemCopy Back time: 7486.972 (ms)
global void cu_calcEnergie( Energie En, int np, float Ka )
{
int j, x, y, n, n_gotEn;
x= blockIdx.x BLOCK_SIZE+threadIdx.x;
y= blockIdx.yBLOCK_SIZE+threadIdx.y;
n= y np+ x;
…
}
void calcPartic( Energie *En, int np, float Ka, int Ro )
{
cudaError_t result;
Energie *En_d;
fprintf(stdout,"\tCuda2: np %d\n",np); fflush(stdout);
unsigned int timer=0;
cutCreateTimer(&timer);
cutStartTimer (timer);
result= cudaMalloc( (void**)&En_d, sizeof(Energie)*np );
if (result != cudaSuccess) { printf("cudaMalloc failed - En_d \n"); exit(1); }
result= cudaMemcpy( En_d, En, sizeof(Energie)*np, cudaMemcpyHostToDevice);
if (result != cudaSuccess) { printf("cudaMemcpy - Host-> GPU failed - En_d \n"); exit(1); }
cutStopTimer(timer);
printf(" MemCopy -> Dev. time: %8.3f (ms)\n",cutGetTimerValue(timer));
cutResetTimer(timer);
cutStartTimer(timer);
dim3 dimblock( BLOCK_SIZE,BLOCK_SIZE, 1); // <512
dim3 dimgrid ( np/(BLOCK_SIZE*BLOCK_SIZE)+1); // (!) 1 , <65535
cu_calcEnergie<<<dimgrid,dimblock>>>( En_d, np, Ka );
cutStopTimer(timer);
printf(" Cuda Processing time: %8.3f (ms)\n",cutGetTimerValue(timer));
cutResetTimer(timer);
cutStartTimer(timer);
result= cudaMemcpy( En, En_d, sizeof(Energie)*np, cudaMemcpyDeviceToHost);
if (result != cudaSuccess) {
printf(" \n *** cudaMemcpy GPU -> Host failed !\n");
exit(1);
}
cutStopTimer(timer);
printf(" MemCopy Back time: %8.3f (ms)\n\n",cutGetTimerValue(timer));
cudaFree(En_d);
}
gonnet
October 9, 2008, 7:04am
#4
see my first attachment for whole Code.
I don’t understand, why the Transfer from the GPU to the CPU for the
same amount of objects is about 52x slower than CPU to Gpu:
horst.
MemCopy -> Dev. time: 143.943 (ms)
Cuda Processing time: 0.058 (ms)
MemCopy Back time: 7486.972 (ms)
[Â …]
result= cudaMemcpy( En, En_d, sizeof(Energie)*np, cudaMemcpyDeviceToHost);
if (result != cudaSuccess) {
printf(" \n *** cudaMemcpy GPU -> Host failed !\n");
exit(1);
}
cutStopTimer(timer);
[snapback]449552[/snapback]
Those call are asynchronous, so you don’t measure anything meaningful unless you make sure the memcpy is actually finished, as Reimar pointed out.
++
Cédric
Reimar
October 9, 2008, 7:07am
#5
Those call are asynchronous, so you don’t measure anything meaningful unless you make sure the memcpy is actually finished, as Reimar pointed out.
[snapback]449554[/snapback]
I guess you mean “the calculation” instead of “the memcpy”, the memcpys necessarily are finished when the function call returns. The kernel call though only starts execution on the GPU and the following memcpy must wait for it to complete before it can even start copying.
gonnet
October 9, 2008, 9:54am
#6
I guess you mean “the calculation” instead of “the memcpy”, the memcpys necessarily are finished when the function call returns. The kernel call though only starts execution on the GPU and the following memcpy must wait for it to complete before it can even start copying.
[snapback]449556[/snapback]
Indeed, thanks for the clarification !
I should not post that early in the morning :)
To be a bit more explicit than previous posters:
Because of the asynchronous calls to the GPU, you must precede any wall clock timing measurement with a call to cudaThreadSynchronize(). If you want more details, search the forums for cudaThreadSyncronize(), there have only been a few hundred threads on this subject…
hopi2
October 10, 2008, 6:15am
#8
Thanks’s for all your advices - now i see my faults.
horst.