Getting diff time statistics for same function Totally confused after seeing results

preetib · November 17, 2007, 9:07am

Hi,

I am trying to implement filter on CUDA. I have developed filter kernel function for that. After kernel execution I have to copy 32400 float values from device to host. And it takes ~2000 milli Seconds. External Image

If I am performing same operation before kernel function execution then it takes ~4 milli seconds.

I am confused after getting different statistics.

Can anybody help me to solve this problem?

I think this might be due to some overheads. If so then is it possible that it is graphics card’s limitation to give such a large amount of overhead??? External Image

Thanks in advance. :)

AndreiB · November 17, 2007, 12:30pm

Do you call cudaThreadSyncronize() after running kernel and before memcpy? I guess no.

Kernel launches are async, meaning that after you invoke kernel control will be passed back to your program immediately (while code on GPU is still running). Any call to cudaMemcpy() causes implicit syncronization: function waits until kernel completes.

So my guess is that you’re measuring kernel execution time, not memory copy. Call cudaThreadSyncronize() right after kernel invokation and your timing should be OK.

nwilt · November 17, 2007, 5:59pm

In CUDA 1.1, events provide an accurate and portable mechanism for timing GPU related events. cu(da)EventRecord records a timestamp asynchronously - the call returns right away and the hardware records the timestamp once all preceding operations have been completed.

For timing purposes, specifying the NULL stream is best.

No synchronization is required until the app wants to call cu(da)EventElapsedTime to compute the difference between two recorded events’ timestamps. Before calling cu(da)EventElapsedTime, the app must call cu(da)EventSynchronize on the events or cuCtxSynchronize/cudaThreadSynchronize to synchronize with the GPU.

For overall wall clock times, it may be preferable to use host based timing mechanisms; but if you want to measure how long the GPU is spending memcpy’ing or in a particular kernel, events are a good option.

preetib · December 4, 2007, 9:12am

Thanks for your input…,

Now I have implemented cudaThreadSyncronize(). And getting far better results… :)

Topic		Replies	Views
cudaMemcpy() Best approach when you need to call it many times? CUDA Programming and Performance	8	25010	March 8, 2010
No need to check cudaThreadSynchronize() in release mode? CUDA Programming and Performance	9	6338	April 21, 2009
Concurrent memcpy and kernel execution CUDA Programming and Performance	5	1410	December 9, 2014
A question about kernel execution CUDA Programming and Performance	1	2617	August 24, 2009
When do I need cudaThreadSynchronize? CUDA Programming and Performance	3	11264	June 16, 2010
cudaMemcpy during kernel execution asynchronous kernel launch CUDA Programming and Performance	2	3081	July 20, 2007
timing performance of kernels how ? cudaprof vs cudaEventRecord vs cutStartTimer CUDA Programming and Performance	3	5291	March 21, 2009
Unable to understand the time unwanted time taken by cudaDeviceSynchronise() CUDA Programming and Performance tensorrt , cuda	1	342	April 12, 2022
is cudaThreadSynchronize() will take 600+ms to execute? CUDA Programming and Performance	3	1538	April 21, 2009
cudaThreadSynchronize usage CUDA Programming and Performance	3	2923	October 21, 2008

Getting diff time statistics for same function Totally confused after seeing results

Related topics