how to evaluate the CUDA's performance how can i know the program is optimazed

babysun · July 16, 2008, 4:07am

hi ,everyone

Nowadays I have finished a small program about image processing using CUDA, but I met a problem.
The problem is that I cannot find a way to evaluate the performance of my project, which is means I don’t know if the performance of this program is all ready optimization or not.

Is any suggestion?
Thank you very much!

cbuchner1 · July 16, 2008, 8:15am

Have it process an image several thousand times in a tight loop, measure the run time. Compare that against the CPU’s run time using the same algorithm. Some CUDA SDK samples show you how to use the timers correctly. The drawback: If you don’t have an optimized CPU version of the algorithm handy, you may have to create it first.

If you want to compare different versions of your kernels against each other, this can also be done with the timers.

You can also perform some theoretical computations, factoring in either available bandwidth on your card (if your application is bandwidth liimitied) or avaiable GFlops on the card (if your application is computationally limited). Then you can calculate - based on the theoretical run times - by which factor your implementation runs slower than the optimum. Don’t ever expect to get close to theoretical performance though ;-)

babysun · July 17, 2008, 2:45am

Thank you, I will try it later!

cbuchner1 · July 17, 2008, 12:42pm

One important thing when using these timers:

Don’t just wrap the timer around the CUDA kernel invocation. I did that and it was a big mistake.

The kernel invocation is asynchronous, and the kernel keeps computing while the CPU is already continuing in the program. So in order to really measure the run time of the kernel you need to include the CUDA MemCopy operation from device memory to host memory before stopping your timer. The memcopy actually blocks until the kernel has finished computing before it copies the values.

Maybe there are other ways to wait until the kernel is done, I haven’t studied the manual yet to that kind of detail.

seibert · July 17, 2008, 1:43pm

cudaThreadSynchronize() is what you are looking for.

paulius · July 21, 2008, 9:36pm

If you want to time the kernel accurately, use CUDA events. For example, look at the simpleStreams sample in the SDK to see how to use events for timing. Event API is described in the Programming Guide. Note that events are recorded on the GPU, so you’ll be timing only GPU execution. The nice benefit is that clock resolution is the period of the GPU shader clock - you should get reliable timings even from a single kernel launch.

If you want to time operations including CPU involvement (like driver overhead), you should use your favorite CPU timer. Just make sure you understand the timer resolution. Also, as seibert pointed out, make sure to call cudaThreadSynchronize() before starting and then again before stopping the timer.

Do not ever use blocking CUDA calls (like memcopies) to achieve synchronicity - that will change your timings terribly.

Paulius

babysun · July 24, 2008, 6:35am

If you want to time the kernel accurately, use CUDA events. For example, look at the simpleStreams sample in the SDK to see how to use events for timing. Event API is described in the Programming Guide. Note that events are recorded on the GPU, so you’ll be timing only GPU execution. The nice benefit is that clock resolution is the period of the GPU shader clock - you should get reliable timings even from a single kernel launch.

If you want to time operations including CPU involvement (like driver overhead), you should use your favorite CPU timer. Just make sure you understand the timer resolution. Also, as seibert pointed out, make sure to call cudaThreadSynchronize() before starting and then again before stopping the timer.

Do not ever use blocking CUDA calls (like memcopies) to achieve synchronicity - that will change your timings terribly.

Paulius

[snapback]413981[/snapback]

Thank you, it is very helpful!!

babysun · July 24, 2008, 6:37am

If you want to time the kernel accurately, use CUDA events. For example, look at the simpleStreams sample in the SDK to see how to use events for timing. Event API is described in the Programming Guide. Note that events are recorded on the GPU, so you’ll be timing only GPU execution. The nice benefit is that clock resolution is the period of the GPU shader clock - you should get reliable timings even from a single kernel launch.

If you want to time operations including CPU involvement (like driver overhead), you should use your favorite CPU timer. Just make sure you understand the timer resolution. Also, as seibert pointed out, make sure to call cudaThreadSynchronize() before starting and then again before stopping the timer.

Do not ever use blocking CUDA calls (like memcopies) to achieve synchronicity - that will change your timings terribly.

Paulius

[snapback]413981[/snapback]

Thank you, it is very helpful!!

Topic		Replies	Views
Mesuring Kernel Performance CUDA Programming and Performance	3	1091	September 29, 2009
Time measurement CUDA Programming and Performance	2	1176	September 13, 2009
calculating execution time CUDA Programming and Performance	4	5550	June 22, 2009
CUDA-Kernel time measurement CUDA Programming and Performance	4	13629	June 9, 2010
Timing cuda code I'm sorry for small for dÃ©ja-vu :-) CUDA Programming and Performance	12	36005	July 12, 2011
Compare Execution Times CPU vs GPU the proper way? CUDA Programming and Performance	5	6082	September 8, 2009
Timer&Event CUDA Programming and Performance	3	3604	December 1, 2009
Timing CUDA Code To find the best way to time CUDA code CUDA Programming and Performance	5	1975	January 6, 2009
Calculate the speed of CUDA program! Is there another way to do this ??? CUDA Programming and Performance	3	2152	August 30, 2008
timing performance of kernels how ? cudaprof vs cudaEventRecord vs cutStartTimer CUDA Programming and Performance	3	5309	March 21, 2009

how to evaluate the CUDA's performance how can i know the program is optimazed

Related topics