It would be helpful to have a special pragma or something where the enclosed code would have the zero fp32 multiplication optimization enabled. Something like:

// Get the value at x of a polynomial known at compile time
__zero_optimize__
__device__ float eval_poly_at_x(float x)
{
// These coefficients can be determined at compile time from a different part of the code
constexpr float coefficient[]{ 0.0f, 0.0f, 0.0f, 0.0f };
constexpr int size = sizeof(coefficient) / sizeof(*coefficient);
float value{};
for (int i = size - 1; i >= 0; i--)
value= coefficient[i] + x * value;
return value;
}

It would only optimize that function. It’s helpful for values that we know for sure will not be NaN or inf and we can just use that optimization without breaking the code.

After studying your code, I have no idea what that is. When x is zero? When value is zero? When coefficient[i] is zero? What shall happen in one of those cases? A run-time check and runtime behavior modification? Some modification of compile time behavior?

Regarding feature requests, use the bug reporting system. Identifying it as a feature request just as you have done here is sufficient to identify it as a feature request, not a bug.

In my example all coefficients are zero so the compiler should just optimize the entire function eval_poly_at_x to return 0.0f every time, after optimizing away all multiplications with zero (removing them completely).

Am I supposed to use the bug reporting system or not? I am very confused. I’m going to use the link to request the feature anyway.

(1) Introduce a new mechanism that applies specific compiler optimizations with function scope only, by introducing a new function attribute. I am reasonable sure that CUDA does not currently support per-function optimization settings. It seems to me whatever is specified would have to be passed through all the way to the linker due existing support for link-time optimization.

(2) Introduce a new separate optimization mode in which floating-point multiplication x*0 always returns 0 and floating-point addition x+0 always returns x, regardless of x (the original post mentions multiplication only, but since coefficients are being added in the example given, I don’t think it would fold as desired unless both of the above requirements are met).

In order to be viable, an enhancement request of this nature will probably require (1) plausible supporting arguments that these new features would have more than niche applicability, e.g. by enumerating known uses cases (2) performance data showing that adding these features will result in non-trivial (e.g. > 5%) performance gains at application level.

Very good point about x+0=x always. Forgot about that one. Normally you can’t assume x+0=x and x*0=0 if you want IEEE 754 standard to be upheld. GCC has a flag that allows the relaxation of that standard to include those two operations. I was hoping we would at least get that at a global level for nvcc as well. Btw, nvcc already does these two optimizations for integer values :)

Two entirely different kettle of fish. Integers don’t have signed zero, infinity, NaN.

In order to preserve the sanity of programmers writing floating-point code, CUDA has historically tried to stick closely to the semantics prescribed by IEEE-754 and it’s C99 bindings (inherited by C++11) with the exception of FMA merging, which is under programmer control, however.