Benchmarking a program What is the best option for finding the FLOP for a given thread?

wagnerj · August 19, 2010, 8:24pm

I am trying to benchmark some of my programs, but I am wondering how best to determine how many floating point operations are particular thread is doing.
Example:
Should I count int threadNum=BlockIdx.x*blockDim.x+threadIdx.x as two? Or does threadNum=Result also equal a FLOP?

Any tips for figuring these out?

seibert · August 20, 2010, 12:24am

Well, to be pedantic, that’s actually two integer operations, which generally go slower in CUDA. (No multiply-add instruction, and integer multiplication takes multiple clock cycles on pre-Fermi devices unless you use special functions to force 24-bit operation.)

That aside, FLOPS are a semi-bogus metric. For sure, everyone counts addition as one operation and also multiplication as one operation. Thus, the throughput of the multiply-add instruction (or the fused multiply-add in Fermi) is used compute the theoretical FLOPS for CUDA devices as it gives you a factor of two boost. Division, however, is slower on most architectures, and transcendental functions are completely off the charts. You can figure out FLOPS pretty easily for simple linear algebra tasks, but in most other cases it can be very ambiguous.

Assignment is not a floating point operation, but it might require an instruction (or not, if the compiler can optimize it away). However, an assignment that turns into a global read or write becomes an instruction that takes many, many clocks, so instruction throughput might not be predictive either.

What is your goal? Maybe we can suggest a better measure.

seibert · August 20, 2010, 12:24am

Well, to be pedantic, that’s actually two integer operations, which generally go slower in CUDA. (No multiply-add instruction, and integer multiplication takes multiple clock cycles on pre-Fermi devices unless you use special functions to force 24-bit operation.)

That aside, FLOPS are a semi-bogus metric. For sure, everyone counts addition as one operation and also multiplication as one operation. Thus, the throughput of the multiply-add instruction (or the fused multiply-add in Fermi) is used compute the theoretical FLOPS for CUDA devices as it gives you a factor of two boost. Division, however, is slower on most architectures, and transcendental functions are completely off the charts. You can figure out FLOPS pretty easily for simple linear algebra tasks, but in most other cases it can be very ambiguous.

Assignment is not a floating point operation, but it might require an instruction (or not, if the compiler can optimize it away). However, an assignment that turns into a global read or write becomes an instruction that takes many, many clocks, so instruction throughput might not be predictive either.

What is your goal? Maybe we can suggest a better measure.

wagnerj · August 20, 2010, 7:22pm

Thanks for the in-depth reply. Very informative.

The short of it is, I am trying to write code optimized to the point that I get close to the 1TFLOPS that my set-up can hypothetically achieve. I was hoping it would be as straight forward as every +, -, * or / is a single operation, but sadly that doesn’t seem like the case.

Also, if ints are non-ideal to handle, would it be more logical to write something like a for with floats?

wagnerj · August 20, 2010, 7:22pm

Thanks for the in-depth reply. Very informative.

The short of it is, I am trying to write code optimized to the point that I get close to the 1TFLOPS that my set-up can hypothetically achieve. I was hoping it would be as straight forward as every +, -, * or / is a single operation, but sadly that doesn’t seem like the case.

Also, if ints are non-ideal to handle, would it be more logical to write something like a for with floats?

tera · August 20, 2010, 10:07pm

Actually, integer multiply-add exists in the PTX instruction set (it is quite common in index calculations), so there is no reason to switch for loops to float variables.

Determining the theoretical maximum speed takes some not very well documented knowledge about the GPUs. Some good references (apart from the info in the Programming Guide) are Benchmarking GPUs to Tune Dense Linear Algebra and Demystifying GPU Microarchitecture through Microbenchmarking. If you could post the inner loop of your kernel (and the device you are executing it on), we can probably help you a bit with that.

tera · August 20, 2010, 10:07pm

Actually, integer multiply-add exists in the PTX instruction set (it is quite common in index calculations), so there is no reason to switch for loops to float variables.

Determining the theoretical maximum speed takes some not very well documented knowledge about the GPUs. Some good references (apart from the info in the Programming Guide) are Benchmarking GPUs to Tune Dense Linear Algebra and Demystifying GPU Microarchitecture through Microbenchmarking. If you could post the inner loop of your kernel (and the device you are executing it on), we can probably help you a bit with that.

seibert · August 21, 2010, 1:35am

Do you happen to know if that is compiled to a single hardware instruction and what the throughput is? I could see this being fast perhaps in the 24-bit case, but if 32-bit multiply takes 4 clocks on pre-Fermi, then I would expect integer MAD is slow as well.

seibert · August 21, 2010, 1:35am

Do you happen to know if that is compiled to a single hardware instruction and what the throughput is? I could see this being fast perhaps in the 24-bit case, but if 32-bit multiply takes 4 clocks on pre-Fermi, then I would expect integer MAD is slow as well.

tera · August 21, 2010, 8:54am

There are mad and mad24 instructions. decuda and nv50dis/nvc0dis show them as well (nvc0dis of course no mad24).

tera · August 21, 2010, 8:54am

There are mad and mad24 instructions. decuda and nv50dis/nvc0dis show them as well (nvc0dis of course no mad24).

Topic		Replies	Views
instruction or operation CUDA Programming and Performance	16	3993	March 28, 2019
Counting FLOPS...again how much does each operation count? CUDA Programming and Performance	5	17154	December 14, 2010
Mythical Tflops CUDA Programming and Performance	11	1289	January 14, 2019
How to quantify speed FLOPs integer and logic operations per second CUDA Programming and Performance	3	2100	September 14, 2011
Arithmetic Operations benchmarking with CUDA FERMI Understanding pure performance of arithmetic on F CUDA Programming and Performance	9	1754	October 27, 2010
How to calculate the total number of FOP and floating-point performance of special operations(exp sin sqrt)? CUDA Programming and Performance	3	5676	December 26, 2016
measure integer instructions by nvprof CUDA Programming and Performance	1	1031	June 11, 2019
Measurements of different CUDA operator throughputs CUDA Programming and Performance	32	50324	August 24, 2009
Question about computing GFLOPS Do fabs and a=-b instructions count? CUDA Programming and Performance	13	4653	February 12, 2010
Counting FLOPS based on SASS code. CUDA Programming and Performance	2	1065	September 27, 2016

Benchmarking a program What is the best option for finding the FLOP for a given thread?

Related topics