code complied with -g -G is different from that compiled with -O

Hi, I met a strange problem that when compiling one CUDA kernel,
the result code complied with -g -G is different from that compiled with -O.

nvcc -gencode arch=compute_20,code=sm_20 -g -G -o gpu_kernels.o -c --compiler-options="-Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -g -I/… -D__INSDIR__= " -------- this works, but

nvcc -gencode arch=compute_20,code=sm_20 -o gpu_kernels.o -c --compiler-options="-Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -O -I/… -D__INSDIR__= " -------- this does not work

-G disables most (almost all) device code optimizations that the (device code) compiler might make. One of the principal reasons for this is so that sane mappings can be made between source code and device machine code, to facilitate source-level debug. (-G also generates additional symbols for debug, like -g does for host code).

I guess by “work/not work” you mean that the code with -G produces the expected result, and the code without it does not.

This happens from time to time, and it usually means one of two things:

  1. there is a logic error or other defect in your code (such as a race condition) that is being exposed when the compiler optimizes code, or
  2. a compiler bug

It may not be possible to tell which without a short, concise, compilable code sample that demonstrates the issue, along with the OS, CUDA version, driver version, and compile command line you are using, as well as the specific GPU you are running on.

In case “it doesn’t work” refers to numerical differences of some kind, further possibilities are:

(3) use of floating-point atomics leads to different operation order [floating-point arithmetic is, in general, not associative]

(4) contraction of FMUL followed by FADD to FMA (fused multiply-add) due to code optimization leads to a reduction in accumulated rounding error and possibly reduces the effects of subtractive cancelation. You can turn off these contractions with -fmad=false

If the issues are not functional but numerical, I would suggest reading NVIDIA’s whitepaper, and the references it cites:


“It works” means numerically right.

My kernel is a 2D finite-difference code, with each thread calculating a grid node. But something new is that for each thread a loop is done which goes over several neighbor nodes and sums the contribution.

The result data are different between using -G and -O. It seems that for -O, the value of a node (only one thread attached) is larger, which means something else make some redundant adding to that node.

I don’t know whether it is due to any ‘fused’ operations.

When compiling with -fmad=false, the result is right!

But, does it drag down the performance?

Disabling FMA contraction does normally have a negative affect on performance. How much, depends on your code, in the worst case performance could be cut in half.

As I mentioned, while the use of FMA often leads to different results compared to the use of separate FMUL and FADD, on average the use of FMA improves the accuracy of the results. The whitepaper I mentioned above provides some insights as to why this is the case. To assess the quality of numerical results in a meaningful way, I would suggest comparing them to a higher precision reference.

For what it’s worth, the latest x86 CPUs also support FMA, so you may observe similar differences with host code depending on which CPU you run on and what compiler switches were used to compile the code.

Thanks a lot, one more question:

Rather than using ‘-fmad=false’ in compiling step, can I cancel FMA in Preprocess like




I am not sure what you are envisioning. #if is used for conditional compilation, while the compiler flag -fmad=false controls a code transformation applied to floating-point expressions.

You could control the code generation tightly by coding everything in intrinsics. So instead of the ‘*’ and ‘+’ operators you would use the device functions __fmul_rn(), __fadd_rn(), __fmaf_rn() in single-precision code and __dmul_rn(), __dadd_rn(), __fma_rn() in double-precision code. The CUDA math library uses this technique extensively to isolate as much as necessary from programmer selected compiler settings (in particular a programmer specifying -fmad=false). It makes for less readable code.

It is no clear why you would want to go to that length. In general the use of FMA is a good thing, which is why more and more architectures incorporate this operation; it has been included in the IEEE-754 floating-point standard since 2008.

I know my question is not so clear.

Conventionally, I put all the global kernels together in one .cu file and compile it. If compiled with -fmad=false,
expected results are obtained, but the performance is dragged down.

Since only one kernel of them makes the ‘fmad’ trouble,
I did isolate it out from others and write into another .cu file then only compile it with -fmad=false.
After that I’ll link all together to the main program.

But, in this way, the result is not right.

I don’t know why it make difference when moving the same global kernel to another file.

If I want to compare a double-precision number with zero, like x>0.0 or y==0.0, is there anything should be cautious? Since my code contains such comparisons, I wonder that’s ok for CUDA.

Lastly, would like to make a conclusion.

By using __dmul_rn() instead of ‘*’, the result is right. Not necessary to compile with -fmad=false.

Double-precision comparisons like x>0.0 or y==0.0 are OK for my kernel.

Thank you all.