floating point precision pragma's

(to the moderator: how can I/or could you please change the thread title to “floating point model pragma’s”?)

Does nvcc have #pragra’s that control the floating point model,
#pragma floatcontrol or similar.

We use if( AX+BY+C > 0 ) construct in a couple of places
in our CUDA code (sm 2.0, Tesla 2070) and need to make sure
it produces IDENTICAL outcomes for given A,B,C,X,Y,
and that would be immune to compiler’s reordering the floating
point operations trying to optimize the code.

The idea is to be able to use something along the lines of
#pragma floatcontrol(precise, on, push)
to ensure that compiler will not interpret the criteria
as, roughly speaking, (AX+BY)+C in one place in the code
and AX+(BY+C) in another. (We would obviously want to
keep the most aggressive optimization for the rest of the code.)

Here are a couple of links that discuss floating point model pragma’s

http://stackoverflow.com/questions/3407493/floating-point-c-compiler-options-preventing-a-b-a-1-b
http://msdn.microsoft.com/en-US/library/45ec64h6(v=vs.80).aspx
http://www.nccs.nasa.gov/images/FloatingPoint_consistency.pdf

There are no such #pragma’s in CUDA to my knowledge. There are intrinsic functions that let you explicitly control where addition, multiplication and fused multiply-add are used, and with what rounding convention. See the second page of Section 5.4.1 in the CUDA programming guide for explanation of functions like __fmul_r[d,u], etc.

In CUDA 4.1, there is a flag in nvcc to disable FMA.

I may not have been clear in my explanation of the (potential) problem we have.

Say I have a function
inline double f(double A, double B, double C, double x, double y) {
return Ax+By+C;
}

and this function is used in two (or more) different places in the code.
The function is declared as “inline” and is short, so most likely
it will be inlined by the compiler (especially if -03 is used).

I need a guaranty that compiler will not compile it into something like
(Ax+By)+C in one place and into Ax+(By+C) in the other.
Or if for example, in one case the compiler will write the “f”
out as a variable to 64-bit memory location and in the other case
will choose to keep it in a register, in which case the numbers
could be different (could they be?).

More generally speaking. How can I make 100% sure that the
result of the function call be ALWAYS the same providing
the same A,B,C,x,y are fed into it in any place in the
code.

The idea of using floating point pragma’s is to try to make compiler
be consistent in compiling this function/line of code, so that the
result is the same (for identical A,B,C,x,y, that is),
not to avoid using FMA instructions. I really do not care how compiler
compiles the formula, I need it to be THE SAME when function
is inlined in different parts of the code.

For all I know (or do not know) nvcc may already work like that. I
would like a confirmation from NVIDIA compiler designers/engineers
or more knowledgeable members of the forum.

The idea of using fmul, fadd functions rather then + and - operators
appeals to me, as I could explicitly tell the compiler the order
of the operations I want. But can one guaranty that compiler
will not reorder the operations to its liking anyway?

Is there an option of the nvcc that spits out the assembly
of the code along with the C source, like regular cpu C compilers do?
I could not seem to find the switch.

Assign partial results to temporary variables. The sequence points at the end of those statements force the compiler to always compile the expression in the same way:

inline double f(double A, double B, double C, double x, double y)

{

    double t = A*x+B*y;

    return t+C;

}

As seibert recommended, use the appropriate single-precision or double-precision intrinsics to enforce the sequence of computations you want. The CUDA math library uses this technique extensively since it is really a collection of header files with lots of inline functions, and it needs to return predictable results even though it is subject to every user selectable compiler switch. In particular it needs to guard against the effects of the -fmad={true|false} switch.

As a general note, the NVVM frontend is even better than the Open64 frontend in finding opportunities for merging FMUL plus FADD into FMAD or FMA. This is good for performance but may occasionally lead to numerical differences when code that was previously compiled with Open64 is now compiled with NVVM.

To dump the machine code generated by the compiler, run the cuobjdump application on the resulting binary.

thanks for the reply.

I do not understand how introducing a temporary variable can prevent the compiler

from optimizing it away (unless maybe “volatile” qualifier is used).

I am under the impression that compiler is at his discretion in handling

intermediate/temporary/local variables. Is it not?

Could NVIDIA guys clarify it?

thanks for the reply.

Now I am seeing a light at the end of the tunnel, so to speak!

I would like to re-iterate the question now including the intrinsics:

If I explicitly use fmul and fadd of fmad or fma intrinsics in a certain order,

the compiler will not even attempt to reorder the operations,

EVEN THOUGH -O3 -fmad=true and all the other optimization switches are ON?

I.e. won’t the compiler ignore any specific order of intrincics when it is

asked to apply heavy optimization?

Or, another question, could it happen that the operations will be translated

into PTX in the desired order, BUT then optimized away, reordered, and replaced

while loading onto the card?

Will the C source code be included between the assembler?

If not, could you offer an advice on how can I find where my AX+BY+C is in the many page assembler code.

Thank you!

Unless the compiler can prove that no floating point exceptions occur, it has to obey sequence points because of possible inexact exceptions raised.

Now Nvidia GPUs don’t support floating point exceptions, so this proof would be rather easy… but I assume that Nvidia hasn’t changed the relevant portions of the compiler and PTX doesn’t reorder operations either. May be these assumptions are too bold though.

You can put instructions around the calculations that are not used anywhere else, like [font=“Courier New”]__prof_trigger()[/font] (which allows you to mark 16 different places). You would then of course need to do a second compile to check that those extra instructions haven’t changed the floating point code…

In the case of PTX you can even insert comments like [font=“Courier New”]asm volatile("// interesting code section is here:")[/font], although you still need to make sure this didn’t reorder the code.

thanks, tera, for you replies.

Could you answer my question regarding how I could generate the disassembly

commented with corresponding C lines, so I would locate the relevant portions

of the code.

Unfortunately I don’t think this is possible. Thus the workaround I suggested.

njuffa or (somebody else from teh NVIDIA compiler group).
Could you please answer my post #8 above. Thanks.

To avoid misunderstandings: While I have worked extensively with our compiler team for the past seven years, ie. the duration of CUDA development, I am not a member of the compiler team.

The intrinsics mentioned earlier (__fadd_r{n,z,u,d}, __dadd_r{n,z,u,d}, etc.) translate into PTX operations with explicit rounding modifiers. The PTX manual specifies the following:

http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/ptx_isa_3.0.pdf

An add instruction with an explicit rounding modifier treated conservatively by the code
optimizer. An add instruction with no rounding modifier defaults to round-to-nearest-even and
may be optimized aggressively by the code optimizer. In particular, mul/add sequences with no
rounding modifiers may be optimized to use fused-multiply-add instructions on the target
device.

A mul instruction with an explicit rounding modifier treated conservatively by the code optimizer.
A mul instruction with no rounding modifier defaults to round-to-nearest-even and may be
optimized aggressively by the code optimizer. In particular, mul/add and mul/sub sequences with
no rounding modifiers may be optimized to use fused-multiply-add instructions on the target
device.