How can I numerically estimate the run time of a CUDA kernel? what are the important points to consider?
For example a simple reduction kernel. can someone provide an example here?
Is it usually presented in terms of cycles?
I know GFLOPS is a measure of throughput, but what if I don’t have any multiply-add operations waiting on a global load in my kernel?