cudaThreadSynchronize() and timing question

small_potato · October 25, 2010, 9:53pm

say I want to time a memory fetching from device global memory

[indent]cudaMemcpy(…cudaMemcpyHostToDevice);
cudaThreadSynchronize();
time1 …

kernel_call();
cudaThreadSynchronize();
time2 …

cudaMemcpy(…cudaMemcpyDeviceToHost);
cudaThreadSynchronize();
time3 …[/indent]

I don’t understand why my time3 and time2 always give same results. My kernel does take a long time to get the result ready for fetching, but shouldn’t cudaThreadSynchronize() block all the operation before kernel_call is done? Also fetching from device memory to host memory shall also take a while, at least noticeable. Thanks.

small_potato · October 25, 2010, 9:53pm

say I want to time a memory fetching from device global memory

[indent]cudaMemcpy(…cudaMemcpyHostToDevice);
cudaThreadSynchronize();
time1 …

kernel_call();
cudaThreadSynchronize();
time2 …

cudaMemcpy(…cudaMemcpyDeviceToHost);
cudaThreadSynchronize();
time3 …[/indent]

I don’t understand why my time3 and time2 always give same results. My kernel does take a long time to get the result ready for fetching, but shouldn’t cudaThreadSynchronize() block all the operation before kernel_call is done? Also fetching from device memory to host memory shall also take a while, at least noticeable. Thanks.

mkaushik · October 26, 2010, 8:54pm

Is there a typo in the memory transfer directions? Looks like it should be cudaMemcpyHostToDevice in the first memcpy and cudaMemcpyDeviceToHost in the second, only then are you timing fetches. Right now since there’s nothing depending on the second memcpy, some optimization might be happening (wild guess). And you don’t need the second cudaThreadSynchronize(), since there would be no gpu threads running.

mkaushik · October 26, 2010, 8:54pm

Is there a typo in the memory transfer directions? Looks like it should be cudaMemcpyHostToDevice in the first memcpy and cudaMemcpyDeviceToHost in the second, only then are you timing fetches. Right now since there’s nothing depending on the second memcpy, some optimization might be happening (wild guess). And you don’t need the second cudaThreadSynchronize(), since there would be no gpu threads running.

small_potato · October 26, 2010, 10:19pm

That’s a type in the post, in the code, it was actually right :"> Well, the timer didn’t work for the right order as well.

small_potato · October 26, 2010, 10:19pm

That’s a type in the post, in the code, it was actually right :"> Well, the timer didn’t work for the right order as well.

mkaushik · October 27, 2010, 4:53am

Did you check every cuda function call for errors ? You can use the cutil macros for that. The last cudaMemcpy() could be failing.

mkaushik · October 27, 2010, 4:53am

Did you check every cuda function call for errors ? You can use the cutil macros for that. The last cudaMemcpy() could be failing.

Topic		Replies	Views
the same thing, different time consuming asking for help CUDA Programming and Performance	5	6323	May 26, 2009
is cudaThreadSynchronize() will take 600+ms to execute? CUDA Programming and Performance	3	1619	April 21, 2009
how to compute time in cuda? CUDA Programming and Performance	3	3818	October 13, 2007
matrixMul skd sample. Where is cudaThreadSynchronize? CUDA Programming and Performance	3	2025	December 19, 2009
How much time is cudaMemcpy() use? CUDA Programming and Performance	1	4071	July 30, 2008
Unable to understand the time unwanted time taken by cudaDeviceSynchronise() CUDA Programming and Performance tensorrt , cuda	1	413	April 12, 2022
When do I need cudaThreadSynchronize? CUDA Programming and Performance	3	11353	June 16, 2010
Can anyone explain the difference in time? CUDA Programming and Performance	2	2502	November 21, 2008
Synchronization synchronizing a n body problem. CUDA Programming and Performance	8	4406	September 22, 2009
cudaThreadSynchronize usage CUDA Programming and Performance	3	2986	October 21, 2008

cudaThreadSynchronize() and timing question

Related topics