How can we count FLOPs in a cuda kernel? Does it have to be from the PTX code?
Also another question is how to find the actual GFLOPS achieved by the code besides the upper limit?
Generally, FLOPS you count to measure performance are “algorithmic” FLOPS (operations in the algorithm, not implementation itself), divided by elapsed time. Most of the time these flops do match what you’re doing during execution, as the majority of “utility” instructions (address/index computation, etc.) are integer operations.
To only the most elemental algorithms does the concept of “algorithmic FLOPS” apply. Things like multiplying matrices and solving FFTs. Most real-world algorithms don’t have some irriducible minimum of “true work” that must be carried out, they’re flexible.
black_ij, this question has been discussed to death, search around. You can’t calculate flops “magically”, you just have to understand your code and know how many operations it will execute. You can also look at the instruction counter in the visual profiler, but this lists all instructions (including integer ones). Honestly, this is actually more useful.
Finally, the concept of FLOPS is sort of irrelevant. First, because most algorithms are limited by bandwidth. Second, because what matters is how the app performs on a GPU vs a CPU. Eg, you get “10 GFLOPS” on your CUDA code. Ok, then what? Well, if a CPU can only do 0.1 GFLOPS, even after being optimized to death, then you’ve actually done really well.