Getting diff time statistics for same function Totally confused after seeing results

Hi,

I am trying to implement filter on CUDA. I have developed filter kernel function for that. After kernel execution I have to copy 32400 float values from device to host. And it takes ~2000 milli Seconds. :w00t:

If I am performing same operation before kernel function execution then it takes ~4 milli seconds.

I am confused after getting different statistics.

Can anybody help me to solve this problem?

I think this might be due to some overheads. If so then is it possible that it is graphics card’s limitation to give such a large amount of overhead??? O:)

Thanks in advance. :)

Do you call cudaThreadSyncronize() after running kernel and before memcpy? I guess no.

Kernel launches are async, meaning that after you invoke kernel control will be passed back to your program immediately (while code on GPU is still running). Any call to cudaMemcpy() causes implicit syncronization: function waits until kernel completes.

So my guess is that you’re measuring kernel execution time, not memory copy. Call cudaThreadSyncronize() right after kernel invokation and your timing should be OK.

In CUDA 1.1, events provide an accurate and portable mechanism for timing GPU related events. cu(da)EventRecord records a timestamp asynchronously - the call returns right away and the hardware records the timestamp once all preceding operations have been completed.

For timing purposes, specifying the NULL stream is best.

No synchronization is required until the app wants to call cu(da)EventElapsedTime to compute the difference between two recorded events’ timestamps. Before calling cu(da)EventElapsedTime, the app must call cu(da)EventSynchronize on the events or cuCtxSynchronize/cudaThreadSynchronize to synchronize with the GPU.

For overall wall clock times, it may be preferable to use host based timing mechanisms; but if you want to measure how long the GPU is spending memcpy’ing or in a particular kernel, events are a good option.

Thanks for your input…,

Now I have implemented cudaThreadSyncronize(). And getting far better results… :)