IMO, term “FLOPS” by itself is not correct way to measure algorithm performance. Since there are some problems:
- What number of FLOPS operations like sin, cos, exp etc. should be converted into?
- What to do with the fact that some operations on CPU can take different clock cycles depending on arguments.
- What ro do with global memory latency and read-after-write dependencies that seriously affect performance.
Thus, IMO the best way of comparsion is time measure of both CPU and GPU implementations.
BTW, i’ve done some experiments with my kernel timimgs
may be they give you some ideas for your investigations.