Numerical estimatation of run time

Hi All,

How can I numerically estimate the run time of a CUDA kernel? what are the important points to consider?
For example a simple reduction kernel. can someone provide an example here?
Is it usually presented in terms of cycles?

I know GFLOPS is a measure of throughput, but what if I don’t have any multiply-add operations waiting on a global load in my kernel?

Thanks.

90% of the time (or more) kernels are limited by the bandwidth to/from device memory. GFLOPs doesn’t mean anything in that situation.

To estimate the time for a simple reduction kernel (or any other bandwidth limited one): simply total up the number of bytes that you must read/write from/to global memory (including texture fetches). Then divide by the device to device bandwidth you get from bandwidthTest (don’t forget unit conversion factors!) and you will have a pretty good estimation of your ideal kernel run time.

Do note that non-coalesced loads/stores or non-optimal texture fetch patterns can reduce the actual run time significantly from an estimate obtained in this way.