code complied with -g -G is different from that compiled with -O

rolyluoli · March 6, 2014, 3:51am

Hi, I met a strange problem that when compiling one CUDA kernel,
the result code complied with -g -G is different from that compiled with -O.

nvcc -gencode arch=compute_20,code=sm_20 -g -G -o gpu_kernels.o -c --compiler-options="-Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -g -I/… -D__INSDIR__= " gpu_kernels.cu -------- this works, but

nvcc -gencode arch=compute_20,code=sm_20 -o gpu_kernels.o -c --compiler-options="-Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -O -I/… -D__INSDIR__= " gpu_kernels.cu -------- this does not work

Robert_Crovella · March 6, 2014, 4:29am

-G disables most (almost all) device code optimizations that the (device code) compiler might make. One of the principal reasons for this is so that sane mappings can be made between source code and device machine code, to facilitate source-level debug. (-G also generates additional symbols for debug, like -g does for host code).

I guess by “work/not work” you mean that the code with -G produces the expected result, and the code without it does not.

This happens from time to time, and it usually means one of two things:

there is a logic error or other defect in your code (such as a race condition) that is being exposed when the compiler optimizes code, or
a compiler bug

It may not be possible to tell which without a short, concise, compilable code sample that demonstrates the issue, along with the OS, CUDA version, driver version, and compile command line you are using, as well as the specific GPU you are running on.

njuffa · March 6, 2014, 6:29am

In case “it doesn’t work” refers to numerical differences of some kind, further possibilities are:

(3) use of floating-point atomics leads to different operation order [floating-point arithmetic is, in general, not associative]

(4) contraction of FMUL followed by FADD to FMA (fused multiply-add) due to code optimization leads to a reduction in accumulated rounding error and possibly reduces the effects of subtractive cancelation. You can turn off these contractions with -fmad=false

If the issues are not functional but numerical, I would suggest reading NVIDIA’s whitepaper, and the references it cites: [url]http://developer.download.nvidia.com/assets/cuda/files/NVIDIA-CUDA-Floating-Point.pdf[/url]

rolyluoli · March 6, 2014, 7:16am

Thanks.

“It works” means numerically right.

My kernel is a 2D finite-difference code, with each thread calculating a grid node. But something new is that for each thread a loop is done which goes over several neighbor nodes and sums the contribution.

The result data are different between using -G and -O. It seems that for -O, the value of a node (only one thread attached) is larger, which means something else make some redundant adding to that node.

I don’t know whether it is due to any ‘fused’ operations.

rolyluoli · March 6, 2014, 7:35am

When compiling with -fmad=false, the result is right!

But, does it drag down the performance?

njuffa · March 6, 2014, 8:13am

Disabling FMA contraction does normally have a negative affect on performance. How much, depends on your code, in the worst case performance could be cut in half.

As I mentioned, while the use of FMA often leads to different results compared to the use of separate FMUL and FADD, on average the use of FMA improves the accuracy of the results. The whitepaper I mentioned above provides some insights as to why this is the case. To assess the quality of numerical results in a meaningful way, I would suggest comparing them to a higher precision reference.

For what it’s worth, the latest x86 CPUs also support FMA, so you may observe similar differences with host code depending on which CPU you run on and what compiler switches were used to compile the code.

rolyluoli · March 6, 2014, 9:17am

Thanks a lot, one more question:

Rather than using ‘-fmad=false’ in compiling step, can I cancel FMA in Preprocess like

#if …
…
#endif

?

njuffa · March 6, 2014, 6:24pm

I am not sure what you are envisioning. #if is used for conditional compilation, while the compiler flag -fmad=false controls a code transformation applied to floating-point expressions.

You could control the code generation tightly by coding everything in intrinsics. So instead of the ‘*’ and ‘+’ operators you would use the device functions __fmul_rn(), __fadd_rn(), __fmaf_rn() in single-precision code and __dmul_rn(), __dadd_rn(), __fma_rn() in double-precision code. The CUDA math library uses this technique extensively to isolate as much as necessary from programmer selected compiler settings (in particular a programmer specifying -fmad=false). It makes for less readable code.

It is no clear why you would want to go to that length. In general the use of FMA is a good thing, which is why more and more architectures incorporate this operation; it has been included in the IEEE-754 floating-point standard since 2008.

rolyluoli · March 7, 2014, 2:19am

I know my question is not so clear.

Conventionally, I put all the global kernels together in one .cu file and compile it. If compiled with -fmad=false,
expected results are obtained, but the performance is dragged down.

Since only one kernel of them makes the ‘fmad’ trouble,
I did isolate it out from others and write into another .cu file then only compile it with -fmad=false.
After that I’ll link all together to the main program.

But, in this way, the result is not right.

I don’t know why it make difference when moving the same global kernel to another file.

rolyluoli · March 7, 2014, 9:33am

If I want to compare a double-precision number with zero, like x>0.0 or y==0.0, is there anything should be cautious? Since my code contains such comparisons, I wonder that’s ok for CUDA.

rolyluoli · March 8, 2014, 3:08am

Lastly, would like to make a conclusion.

By using __dmul_rn() instead of ‘*’, the result is right. Not necessary to compile with -fmad=false.

Double-precision comparisons like x>0.0 or y==0.0 are OK for my kernel.

Thank you all.

Topic		Replies	Views
Floating point operations IEE compliance and debug mode CUDA Programming and Performance	3	1056	April 4, 2013
[4.0] compiling for cuda-gdb (-G) results in the correct result, while omitting -G does not CUDA Programming and Performance	3	809	June 14, 2011
What does -G flag do exactly? CUDA Programming and Performance	2	569	June 14, 2023
Kernel is massivly slower when compiling without the "-G" flag CUDA Programming and Performance	3	785	June 21, 2016
compiling with / without "-G" gives me DIFFERENT result CUDA Programming and Performance	4	1567	January 23, 2010
different output when compiled for emulation, device, and device with -g -G CUDA Programming and Performance	7	3054	October 26, 2009
Different behavior on omitting --device-debug CUDA Programming and Performance	2	646	February 16, 2017
fma() CUDA Programming and Performance	2	9521	April 20, 2014
Different results in Debug and Release mode compile CUDA Programming and Performance	9	4230	October 20, 2015
Bad GPU performance when compiling with -G parameter with nvcc compiler CUDA Programming and Performance	3	749	May 12, 2014

code complied with -g -G is different from that compiled with -O

Related topics