What is the correct method to analytically estimate the kernel performance in FLOPS on the CUDA GPU? My goal is not to measure max. performance of the GPU, but to get the correct estimate of FLOPS for this algorithm and compare it with the CPU implementation.
analyzing FLOPS (counting each operation as 1 FLOP) is useful for determining if you are utilizing the full device capabilities. But one should then also look at how many GB/s you are doing to see if you have reached either of the limits.