Hello
I’ve implemented a CUDA program.
And now I’m measuring the time for performing kernel functions and the time for transmitting data (from GPU to CPU).
BTW, to measure data transmission time, I wrote codes like below.
cutCreateTimer( &timer2);
cutStartTimer( timer2);
cudaMemcpy(transformToWorld, d_transformToWorld, sizeof(cuSpatialTransform)*totHandles, cudaMemcpyDeviceToHost);
cutStopTimer(timer2);
printf(“Data Transmission Time: %f\n\n”, cutGetTimerValue(timer2));
Here, ‘cuSpatialTransform’ is a structure which has 12 float type variables.
There is Case 1 and its value of totHandles is 256.
When I execute this program with case 1, the data trasmission time(the value of timer2) is 1.6~1.7 (ms) in average.
And there is Case 2 which has 229 as the value of totHandles. It’s smaller than that of Case 1.
But the average data transmission time of this case 2 is 64~65 (ms) !!
I don’t understand it.
How this kind of thing is possible?
try putting cudaThreadSynchronize() right before cutStartTimer( timer2)
Wow,
now the data transmission time of Case 2 is 0.02~0.03
What does cudaTreadSynchronize do to timer?
Anyway thank you very much!!! :D
Wow,
now the data transmission time of Case 2 is 0.02~0.03
What does cudaTreadSynchronize do to timer?
Anyway thank you very much!!! :D
It ensures that all previous asynchronous operations on the GPU have been completed. Basically, if you had called a kernel prior to the memcpy and then timed the memcpy, you were really timing the kernel+memcpy.
It ensures that all previous asynchronous operations on the GPU have been completed. Basically, if you had called a kernel prior to the memcpy and then timed the memcpy, you were really timing the kernel+memcpy.
I got it!
Thanks a bunch for your kind answer :)