# Benchmarking a program What is the best option for finding the FLOP for a given thread?

I am trying to benchmark some of my programs, but I am wondering how best to determine how many floating point operations are particular thread is doing.
Example:

Any tips for figuring these out?

Well, to be pedantic, that’s actually two integer operations, which generally go slower in CUDA. (No multiply-add instruction, and integer multiplication takes multiple clock cycles on pre-Fermi devices unless you use special functions to force 24-bit operation.)

That aside, FLOPS are a semi-bogus metric. For sure, everyone counts addition as one operation and also multiplication as one operation. Thus, the throughput of the multiply-add instruction (or the fused multiply-add in Fermi) is used compute the theoretical FLOPS for CUDA devices as it gives you a factor of two boost. Division, however, is slower on most architectures, and transcendental functions are completely off the charts. You can figure out FLOPS pretty easily for simple linear algebra tasks, but in most other cases it can be very ambiguous.

Assignment is not a floating point operation, but it might require an instruction (or not, if the compiler can optimize it away). However, an assignment that turns into a global read or write becomes an instruction that takes many, many clocks, so instruction throughput might not be predictive either.

What is your goal? Maybe we can suggest a better measure.

Well, to be pedantic, that’s actually two integer operations, which generally go slower in CUDA. (No multiply-add instruction, and integer multiplication takes multiple clock cycles on pre-Fermi devices unless you use special functions to force 24-bit operation.)

That aside, FLOPS are a semi-bogus metric. For sure, everyone counts addition as one operation and also multiplication as one operation. Thus, the throughput of the multiply-add instruction (or the fused multiply-add in Fermi) is used compute the theoretical FLOPS for CUDA devices as it gives you a factor of two boost. Division, however, is slower on most architectures, and transcendental functions are completely off the charts. You can figure out FLOPS pretty easily for simple linear algebra tasks, but in most other cases it can be very ambiguous.

Assignment is not a floating point operation, but it might require an instruction (or not, if the compiler can optimize it away). However, an assignment that turns into a global read or write becomes an instruction that takes many, many clocks, so instruction throughput might not be predictive either.

What is your goal? Maybe we can suggest a better measure.

Thanks for the in-depth reply. Very informative.

The short of it is, I am trying to write code optimized to the point that I get close to the 1TFLOPS that my set-up can hypothetically achieve. I was hoping it would be as straight forward as every +, -, * or / is a single operation, but sadly that doesn’t seem like the case.

Also, if ints are non-ideal to handle, would it be more logical to write something like a for with floats?

Thanks for the in-depth reply. Very informative.

The short of it is, I am trying to write code optimized to the point that I get close to the 1TFLOPS that my set-up can hypothetically achieve. I was hoping it would be as straight forward as every +, -, * or / is a single operation, but sadly that doesn’t seem like the case.

Also, if ints are non-ideal to handle, would it be more logical to write something like a for with floats?

Actually, integer multiply-add exists in the PTX instruction set (it is quite common in index calculations), so there is no reason to switch for loops to float variables.

Determining the theoretical maximum speed takes some not very well documented knowledge about the GPUs. Some good references (apart from the info in the Programming Guide) are Benchmarking GPUs to Tune Dense Linear Algebra and Demystifying GPU Microarchitecture through Microbenchmarking. If you could post the inner loop of your kernel (and the device you are executing it on), we can probably help you a bit with that.

Actually, integer multiply-add exists in the PTX instruction set (it is quite common in index calculations), so there is no reason to switch for loops to float variables.

Determining the theoretical maximum speed takes some not very well documented knowledge about the GPUs. Some good references (apart from the info in the Programming Guide) are Benchmarking GPUs to Tune Dense Linear Algebra and Demystifying GPU Microarchitecture through Microbenchmarking. If you could post the inner loop of your kernel (and the device you are executing it on), we can probably help you a bit with that.

Do you happen to know if that is compiled to a single hardware instruction and what the throughput is? I could see this being fast perhaps in the 24-bit case, but if 32-bit multiply takes 4 clocks on pre-Fermi, then I would expect integer MAD is slow as well.

Do you happen to know if that is compiled to a single hardware instruction and what the throughput is? I could see this being fast perhaps in the 24-bit case, but if 32-bit multiply takes 4 clocks on pre-Fermi, then I would expect integer MAD is slow as well.

There are mad and mad24 instructions. decuda and nv50dis/nvc0dis show them as well (nvc0dis of course no mad24).

There are mad and mad24 instructions. decuda and nv50dis/nvc0dis show them as well (nvc0dis of course no mad24).