Integer performance vs floating point

Biaowang · March 27, 2014, 9:28pm

Hi, all:

I benchmark the peak performance with OpenCLBench and observed that the peak performance of floating point is approaching to the theoretical peak (GT 430), however, the GFlops for integer is about half, so I suspect that the capability of FMA is only applied for floating point, not integer.
So anyone could confirm it and explain why nvidia disable the FMA for integer?

Best

njuffa · March 27, 2014, 10:03pm

FMA (fused multiply-add) is a floating-point operation. On NVIDIA GPUs with compute capability 2.0 or higher there are single-precision and double-precision instruction versions (FFMA and DFMA) of this operation. As this is a floating-point operation there is no connection with integer performance.

Uncle_Joe · March 28, 2014, 2:14am

Right. There is no such thing as a “fused” integer multiply-add because we don’t round the multiplier’s output, hence it doesn’t need that distinction.

But, there are Kepler instructions that do integer multiply-add in 1 instruction. But there are fewer integer multipliers than floating point multipliers because a 32bit integer multiplier would cost ~1.8x = (32/24)^2 more hardware than the 24bit integer multiplier used in a float unit. The actual difference is probably less since the integer multiplier doesn’t have to do floating point exponent addition and significand rounding.

cbuchner1 · March 28, 2014, 10:47am

where would I find more about the Kepler integer multiply-add? Are these exposed through intrinsics? I do a lot of crypto stuff as a hobby (cudaminer, ccminer) and these instructions might be of use.

Gert-Jan · March 28, 2014, 12:36pm

According to the instruction set reference, integer multiply-add (IMAD) has been part of the instruction set since the beginning (compute capability 1.x). The performance of multiply-add is the same as multiply. When you disassemble a kernel (cuobjdump -sass) you will see a lot of IMAD instructions. Apparently the compiler is already smart enough to schedule these IMAD instructions, so no need for intrinsics here.

For those interested, a list of integer intrinsics can be found here. I’m not sure how all these intrinsics map to actual instructions, some, like __sad() map to a single instruction.

For more integer performance there are also the SIMD Video Instructions which are described best in the PTX documentation. However, according to SPWorley’s forum post these instructions are no longer natively supported in Maxwell.

Uncle_Joe · March 28, 2014, 12:40pm

You can just disassemble the cubin files to see the IMAD instructions. There are 3 variants: IMAD.LO (compute lower 32 bits of result), IMAD.HI, and IMAD.WIDE (compute full 64 bit result). I’ve seen IMAD.WIDE used frequently for address calculations (index * scale + base) within inner loops. Code compiled for a CPU would almost always convert expressions like a[i] to an induction variable (pointer that increments every iteration), but maybe on the GPU, they think saving a register is worth a multiply-add?

Topic		Replies	Views
Peak Performance of integer operation CUDA Programming and Performance	3	2841	May 11, 2017
int32 Vs float32 performance difference and analysis advice CUDA Programming and Performance	2	5930	July 31, 2017
instruction or operation CUDA Programming and Performance	16	2756	March 28, 2019
Disable Fused Multiply-Add(FMA) with Numba CUDA Programming and Performance	7	2277	March 16, 2017
CUDA "fmsub" performance against negation+fma CUDA Programming and Performance	4	1506	July 14, 2015
Cuda 3.5 Integer Multiply Performance Is it really 3x slower than 64-bit floating point? CUDA Programming and Performance	21	19859	March 12, 2014
Forward looking GPU integer performance CUDA Programming and Performance	22	21218	March 20, 2017
16 bit int multiplication using SIMD / mixed precision CUDA Programming and Performance	7	1734	October 12, 2021
Mythical Tflops CUDA Programming and Performance	11	1059	January 14, 2019
performance of integer vs float CUDA Programming and Performance	10	21144	June 15, 2009

Integer performance vs floating point

Related topics