I am trying to benchmark some of my programs, but I am wondering how best to determine how many floating point operations are particular thread is doing.
Example:
Should I count int threadNum=BlockIdx.x*blockDim.x+threadIdx.x as two? Or does threadNum=Result also equal a FLOP?

Well, to be pedantic, that’s actually two integer operations, which generally go slower in CUDA. (No multiply-add instruction, and integer multiplication takes multiple clock cycles on pre-Fermi devices unless you use special functions to force 24-bit operation.)

That aside, FLOPS are a semi-bogus metric. For sure, everyone counts addition as one operation and also multiplication as one operation. Thus, the throughput of the multiply-add instruction (or the fused multiply-add in Fermi) is used compute the theoretical FLOPS for CUDA devices as it gives you a factor of two boost. Division, however, is slower on most architectures, and transcendental functions are completely off the charts. You can figure out FLOPS pretty easily for simple linear algebra tasks, but in most other cases it can be very ambiguous.

Assignment is not a floating point operation, but it might require an instruction (or not, if the compiler can optimize it away). However, an assignment that turns into a global read or write becomes an instruction that takes many, many clocks, so instruction throughput might not be predictive either.

What is your goal? Maybe we can suggest a better measure.

Well, to be pedantic, that’s actually two integer operations, which generally go slower in CUDA. (No multiply-add instruction, and integer multiplication takes multiple clock cycles on pre-Fermi devices unless you use special functions to force 24-bit operation.)

That aside, FLOPS are a semi-bogus metric. For sure, everyone counts addition as one operation and also multiplication as one operation. Thus, the throughput of the multiply-add instruction (or the fused multiply-add in Fermi) is used compute the theoretical FLOPS for CUDA devices as it gives you a factor of two boost. Division, however, is slower on most architectures, and transcendental functions are completely off the charts. You can figure out FLOPS pretty easily for simple linear algebra tasks, but in most other cases it can be very ambiguous.

Assignment is not a floating point operation, but it might require an instruction (or not, if the compiler can optimize it away). However, an assignment that turns into a global read or write becomes an instruction that takes many, many clocks, so instruction throughput might not be predictive either.

What is your goal? Maybe we can suggest a better measure.

The short of it is, I am trying to write code optimized to the point that I get close to the 1TFLOPS that my set-up can hypothetically achieve. I was hoping it would be as straight forward as every +, -, * or / is a single operation, but sadly that doesn’t seem like the case.

Also, if ints are non-ideal to handle, would it be more logical to write something like a for with floats?

The short of it is, I am trying to write code optimized to the point that I get close to the 1TFLOPS that my set-up can hypothetically achieve. I was hoping it would be as straight forward as every +, -, * or / is a single operation, but sadly that doesn’t seem like the case.

Also, if ints are non-ideal to handle, would it be more logical to write something like a for with floats?

Actually, integer multiply-add exists in the PTX instruction set (it is quite common in index calculations), so there is no reason to switch for loops to float variables.

Actually, integer multiply-add exists in the PTX instruction set (it is quite common in index calculations), so there is no reason to switch for loops to float variables.

Do you happen to know if that is compiled to a single hardware instruction and what the throughput is? I could see this being fast perhaps in the 24-bit case, but if 32-bit multiply takes 4 clocks on pre-Fermi, then I would expect integer MAD is slow as well.

Do you happen to know if that is compiled to a single hardware instruction and what the throughput is? I could see this being fast perhaps in the 24-bit case, but if 32-bit multiply takes 4 clocks on pre-Fermi, then I would expect integer MAD is slow as well.