Different results in Debug and Release mode compile

I’ve never seen this before.
I’m getting slightly different results when compiling in debug mode.
The results are exactly the same from run to run so I don’t think it’s a race condition.
What else could cause accuracy differences between Debug and release?

If your code uses floating-point computation, the most likely reason is that FADD dependent upon FMUL will frequently get contracted to FMA (fused multiply-add) in a release build. To confirm that this explains the differences, you can turn off this contraction by building with -fma=false. Since this typically has a negative impact on both accuracy and performance you wouldn’t want to use that for your production build, but it is useful for experiments.

I understand your claim about no race conditions, but I would also suggest running such a code using cuda-memcheck including separate runs with each of the sub-tool options (initcheck, synccheck, racecheck, etc.)

http://docs.nvidia.com/cuda/cuda-memcheck/index.html#abstract

If I put -fma=false in the additional options, I get: nvcc fatal : Unknown option ‘fma’

If I put it in Additional Compiler Options, I get:
1>cl : Command line warning D9002: ignoring unknown option ‘-fma=false’

Where should I put this in VS propoerties?

Sorry, I mistyped: -fmad=false. Hint: nvcc --help will list the command line arguments it accepts.

Is it by any chance supposed to be --fmad=false ?
ETA: just saw your correction

SOLVED: The difference was due to fused multiply add.

Disabled it in release compile and the results are identical to debug.

Keep in mind that the use of the fused multiply-add (FMA) usually has a significant positive impact on performance, and a noticeable positive impact on accuracy, so you would ultimately want to allow the compiler to use the contraction for release builds.

Since FMA-contraction is an optimization, and debug builds are un-optimized, these kind of numerical discrepancies are common (and occur in like fashion on CPUs that support FMA). One way around this that preserves full performance is to code the FMAs directly, by using the standard C/C++ math library functions fma() and fmaf(). Depending on the nature of your code, that may be trivial to do, or a pain in the neck.

Personally, I have gotten into the habit of using fma() directly in the source code: Often there are multiple ways to re-arrange a computation for the use of FMA, only one of which is “optimal” in terms of accuracy. The compiler does not understand numerical analysis, it just walks the DAG constructed from the source expression and contracts FADD dependent on FMUL.

@njuffa, they’re trying to mechanize your knowledge! :)

http://herbie.uwplse.org/

Interesting, hadn’t heard of that. I am going to take a look. I am not a numerical analysis guy either (my degree is in CS, not math), but I have a reasonable understanding of many numerical issues. But just the other week I had to investigate the most accurate sequence to use for the most significant terms of a polynomial, in brute force fashion. While this seemed like a simple situation, the best variant I found (out of five or so arrangements I tried), was not at all what I would have expected.

So any tool that can reason intelligently about numerics, in particular in the presence of FMA, has the potential of saving quite some time. Overall there seems to be too little written in the literature about all the improvements that use of FMA enables.