Estimating performance in FLOPS what's the correct way to do it?

What is the correct method to analytically estimate the kernel performance in FLOPS on the CUDA GPU? My goal is not to measure max. performance of the GPU, but to get the correct estimate of FLOPS for this algorithm and compare it with the CPU implementation.

Is there some reference doc on this topic maybe?

Thanks in advance!

IMO, term “FLOPS” by itself is not correct way to measure algorithm performance. Since there are some problems:

  1. What number of FLOPS operations like sin, cos, exp etc. should be converted into?
  2. What to do with the fact that some operations on CPU can take different clock cycles depending on arguments.
  3. What ro do with global memory latency and read-after-write dependencies that seriously affect performance.

Thus, IMO the best way of comparsion is time measure of both CPU and GPU implementations.

BTW, i’ve done some experiments with my kernel timimgs
may be they give you some ideas for your investigations.

analyzing FLOPS (counting each operation as 1 FLOP) is useful for determining if you are utilizing the full device capabilities. But one should then also look at how many GB/s you are doing to see if you have reached either of the limits.