Numerical estimatation of run time

gpugpu · June 11, 2009, 3:05am

Hi All,

How can I numerically estimate the run time of a CUDA kernel? what are the important points to consider?
For example a simple reduction kernel. can someone provide an example here?
Is it usually presented in terms of cycles?

I know GFLOPS is a measure of throughput, but what if I don’t have any multiply-add operations waiting on a global load in my kernel?

Thanks.

MisterAnderson42 · June 11, 2009, 11:44am

90% of the time (or more) kernels are limited by the bandwidth to/from device memory. GFLOPs doesn’t mean anything in that situation.

To estimate the time for a simple reduction kernel (or any other bandwidth limited one): simply total up the number of bytes that you must read/write from/to global memory (including texture fetches). Then divide by the device to device bandwidth you get from bandwidthTest (don’t forget unit conversion factors!) and you will have a pretty good estimation of your ideal kernel run time.

Do note that non-coalesced loads/stores or non-optimal texture fetch patterns can reduce the actual run time significantly from an estimate obtained in this way.

Topic		Replies	Views
Measuring GFLOPS for a kernel CUDA Programming and Performance	1	4537	March 26, 2009
Speed-up and bandwidth CUDA Programming and Performance	12	9780	May 4, 2008
Bandwidth limited, Latency limited and Compute limited Need examples for each case CUDA Programming and Performance	1	6465	March 17, 2010
FLOP count CUDA Programming and Performance	3	6647	December 10, 2008
simple question measure Flops, Bandwidth CUDA Programming and Performance	0	2006	January 28, 2011
Performance measurement CUDA Programming and Performance	3	642	April 29, 2011
Estimating performance in FLOPS what's the correct way to do it? CUDA Programming and Performance	2	9052	February 20, 2008
Measuring running time CUDA Programming and Performance	1	1438	June 13, 2009
Flops counter may be just simple script? CUDA Programming and Performance	8	5624	November 19, 2008
Simple test, unexpected results: more calculations in each thread, less GPU occupancy time! CUDA Programming and Performance	5	1127	May 27, 2013

Numerical estimatation of run time

Related topics