how to evaluate the CUDA's performance how can i know the program is optimazed

hi ,everyone

Nowadays I have finished a small program about image processing using CUDA, but I met a problem.
The problem is that I cannot find a way to evaluate the performance of my project, which is means I don’t know if the performance of this program is all ready optimization or not.

Is any suggestion?
Thank you very much!

Have it process an image several thousand times in a tight loop, measure the run time. Compare that against the CPU’s run time using the same algorithm. Some CUDA SDK samples show you how to use the timers correctly. The drawback: If you don’t have an optimized CPU version of the algorithm handy, you may have to create it first.

If you want to compare different versions of your kernels against each other, this can also be done with the timers.

You can also perform some theoretical computations, factoring in either available bandwidth on your card (if your application is bandwidth liimitied) or avaiable GFlops on the card (if your application is computationally limited). Then you can calculate - based on the theoretical run times - by which factor your implementation runs slower than the optimum. Don’t ever expect to get close to theoretical performance though ;-)

Thank you, I will try it later!

One important thing when using these timers:

Don’t just wrap the timer around the CUDA kernel invocation. I did that and it was a big mistake.

The kernel invocation is asynchronous, and the kernel keeps computing while the CPU is already continuing in the program. So in order to really measure the run time of the kernel you need to include the CUDA MemCopy operation from device memory to host memory before stopping your timer. The memcopy actually blocks until the kernel has finished computing before it copies the values.

Maybe there are other ways to wait until the kernel is done, I haven’t studied the manual yet to that kind of detail.

cudaThreadSynchronize() is what you are looking for.

If you want to time the kernel accurately, use CUDA events. For example, look at the simpleStreams sample in the SDK to see how to use events for timing. Event API is described in the Programming Guide. Note that events are recorded on the GPU, so you’ll be timing only GPU execution. The nice benefit is that clock resolution is the period of the GPU shader clock - you should get reliable timings even from a single kernel launch.

If you want to time operations including CPU involvement (like driver overhead), you should use your favorite CPU timer. Just make sure you understand the timer resolution. Also, as seibert pointed out, make sure to call cudaThreadSynchronize() before starting and then again before stopping the timer.

Do not ever use blocking CUDA calls (like memcopies) to achieve synchronicity - that will change your timings terribly.


Thank you, it is very helpful!!

Thank you, it is very helpful!!