Integer performance vs floating point

Hi, all:

I benchmark the peak performance with OpenCLBench and observed that the peak performance of floating point is approaching to the theoretical peak (GT 430), however, the GFlops for integer is about half, so I suspect that the capability of FMA is only applied for floating point, not integer.
So anyone could confirm it and explain why nvidia disable the FMA for integer?


FMA (fused multiply-add) is a floating-point operation. On NVIDIA GPUs with compute capability 2.0 or higher there are single-precision and double-precision instruction versions (FFMA and DFMA) of this operation. As this is a floating-point operation there is no connection with integer performance.

Right. There is no such thing as a “fused” integer multiply-add because we don’t round the multiplier’s output, hence it doesn’t need that distinction.

But, there are Kepler instructions that do integer multiply-add in 1 instruction. But there are fewer integer multipliers than floating point multipliers because a 32bit integer multiplier would cost ~1.8x = (32/24)^2 more hardware than the 24bit integer multiplier used in a float unit. The actual difference is probably less since the integer multiplier doesn’t have to do floating point exponent addition and significand rounding.

where would I find more about the Kepler integer multiply-add? Are these exposed through intrinsics? I do a lot of crypto stuff as a hobby (cudaminer, ccminer) and these instructions might be of use.

According to the instruction set reference, integer multiply-add (IMAD) has been part of the instruction set since the beginning (compute capability 1.x). The performance of multiply-add is the same as multiply. When you disassemble a kernel (cuobjdump -sass) you will see a lot of IMAD instructions. Apparently the compiler is already smart enough to schedule these IMAD instructions, so no need for intrinsics here.

For those interested, a list of integer intrinsics can be found here. I’m not sure how all these intrinsics map to actual instructions, some, like __sad() map to a single instruction.

For more integer performance there are also the SIMD Video Instructions which are described best in the PTX documentation. However, according to SPWorley’s forum post these instructions are no longer natively supported in Maxwell.

You can just disassemble the cubin files to see the IMAD instructions. There are 3 variants: IMAD.LO (compute lower 32 bits of result), IMAD.HI, and IMAD.WIDE (compute full 64 bit result). I’ve seen IMAD.WIDE used frequently for address calculations (index * scale + base) within inner loops. Code compiled for a CPU would almost always convert expressions like a[i] to an induction variable (pointer that increments every iteration), but maybe on the GPU, they think saving a register is worth a multiply-add?