CUDA "fmsub" performance against negation+fma

Hi, this is my first question in the forum!

I ran into this while testing fma (float multiply+add) operation. I have this sample code.

__device__ __constant__ int N;

__global__ void my_kernel(float4 a, int *out)
{
    const int res = int(a.x + threadIdx.x*float(-N));

    out[threadIdx.x]=res;
}

int main() { }

This code produces an fma instruction (I use cuobjdump -ptx <exec_file> to see that). The float casting is needed in order to achieve that. The thing is that if I change the core line to:

const int res = int(f.x - threadIdx.x*N);

then same output should be returned, but it won’t produce an fma.

I have two questions.

  1. Why is the float casting needed to produce an fma?

  2. Is it really worth it to use an extra operation: sum+negation instead of substraction? I assume that the performance should be the same because the time you get using fma is now wasted again with the negation instruction. Maybe CUDA capable graphics cards have hardware to make negation an instant operation, but I was not able to find information about it. Any ideas?

The fma operation is a feature of the floating point ALU.

The hardware’s Integer ALUs are separate and not capable of fma.

Fun trivia: The earliest CUDA 1.x devices used parts of the floating point ALUs to perform integer multiplication. This is why only 24 bit integer multiplication was supported in hardware (single precision floating point mantissa has 24 bits), and 32 bit integer multiplications required several instructions to perform.

Ok, first question solved then. Yes, fma is defined as fma(float, float, float). I guess that only 1 cast to float is the rest cast automatically to perform the addition and the multiplication.

How about the second one? Should I use fma or leave it as integer subtraction+multiplication?

Such questions are usually difficult to give absolute answers to. It depends on the exact code generated, which will depend on the exact CUDA version and architecture compiled for, may depend on the GPU (different GPUs have different ratios between floating point throughput and integer throughput) and may even depend on what else is going on in your code. If your code is a pure integer code, finding unusual ways to shift operations to the floating point units may speed up your code, due to integer unit pressure.

Try benchmarking it.

The GPU instruction FMA d,a,b,c (i.e. d = ab+c) allows negation operators to be applied to ab as well as c at no cost, so the issue here is not the lack of a dedicated FMSUB operation. Your observation is easily explained by applying C/C++ operator precedence and type promotion rules:

(a) multiplication has precedence over addition/subtraction
(b) in a two-input mathematical operation where one operand is ‘int’, the other ‘float’, the ‘int’ operand is converted to ‘float’ prior to the operation

case 1: a.x + threadIdx.x*float(-N)

Based on (a), threadIdx.x*float(-N) is evaluated first. Based on (b) this is evaluated as if you had written float(threadIdx.x)*float(-N), i.e. a floating-point multiplication is performed. Next the ‘float’ product is added to a.x, which is ‘float’, so a floating-point addition is performed. During optimization the compiler recognizes that a floating-point multiplication followed by a dependent floating-point addition can be contracted into FMA.

case 2: a.x - threadIdx.x*N

Based on (a), threadIdx.xN is evaluated first. Since both operands are of type ‘int’, an integer multiplication is performed. Based on (b), the ‘int’ product is then converted to ‘float’ the result enters into a floating-point addition, as if you had written a.x - float(threadIdx.xN). So here the dependency chain is: ‘int’ multiplication, ‘int’ to ‘float’ conversion, ‘float’ multiplication. No contraction exists that can be applied to this sequence.