Cuda 3.5 Integer Multiply Performance Is it really 3x slower than 64-bit floating point?

Hi njuffa,
thanks. I will think about it…
I thought that it was possible to multiply integers being at most half the size of the mantissa (12-13 bits) without losing some digits because of rounding.
Am I missing something?

In general, a floating-point multiplication will return the N top-most bits of a product (modulo any rounding), while an integer multiply return the N bottom-most bits.

The “trick” in the above sequence is that FMA computes the full, unrounded, product internally. By subtracting out the previously computed high bits, we can thus retrieve the low bits. So one could handle up to a 24x24-bit product, however that would leave no bits for accumulation of the partial products, which is why I proposed 16-bit operands.

As I said, the above is a code sketch demonstrating the idea, not fully tested code.