What is source of incorrect floating point math?

I keep running into cases where my Cuda application deterministically gets incorrect answers for floating-point math. Presently, I work around by tweaking the code a bit.

For example, the following code fails for me on certain testcases on a V100 card, but works on other GPU hardware with same testcase.

    float const ratio      = bias / ( bias + distance ) / orig;
    float const base       = right + gap;
    float const size       = base + gap;
    float const remainder  = extended - size;
#if 1
    float const sum        = remainder + gap;
    float const reciprocal = 1 / remainder;
    float const answer     = ratio * density * sum * reciprocal;
#else
    float const answer     = ratio * density * ( remainder + gap ) / remainder;
    // answer: 5000.5000000000 ratio: 1.0000000000 density: 0.1538460851
    // remainder: 14.0000000000 gap: 1.9870082140
#endif

As shown, the #if 1 code works around the bug where the #else shows the original code with a comment giving the incorrect answer computed.

I’m using Cuda compilation tools, release 9.2, V9.2.88 and the driver version on the machine that fails is v440.44.

I’ve seen other examples of this on other hardware, using different versions of the driver, but the same version of the compiler, so my guess is a bug in the compiler.

Any ideas about what the likely culprit is or how to efficiently determine it? The compiler? The driver? The specific hardware instance of V100? Something else?

It is impossible to diagnose this from a code snippet. Consider posting a minimal self-contained program that others can build and run to reproduce the issue.

Exactly how incorrect is “incorrect”? Mathematically equivalent variants of a particular computation are not necessarily equivalent when evaluated with floating-point arithmetic; bit-wise identical answers are not to be expected in such a case. Compiler bugs are possible but tend to be rare in mature toolchains like the CUDA compiler.

Wildly incorrect answer as given in the comment shown. The 5000.5 is random number unrelated to correct answer.

No root cause analysis possible without a minimal complete repro case. If you suspect a compiler bug and intend to file a bug report with NVIDIA, you will need that also.

Creating a stand-alone test case would be tricky. For now, I have a work-around, use -O3 instead of -O and the issue goes away.

My suggestion would be to provide the following:

  1. A complete code, preferably as short as possible. That should be something that I can copy, paste, compile, and run, and see the issue, without having to add anything or change anything.
  2. A platform description, which would include: GPU type, GPU driver version, CUDA version, Operating System, and the compile command line use to compile the item provided in 1 above.

If you want to provide that, I will take a(nother) look at it. If you don’t wish to for whatever reason, I’m less likely to look at it. It’s entirely your call as to what to do. These are just suggestions I am offering. Do as you wish.

FWIW I did try to assemble a test case out of what you have shown, and using CUDA 9.2 on a Tesla V100, I was unable to produce anything other than 0.175681 for answer. Yes, my GPU driver is different, but I think that is unlikely to be a contributor here, so I’m not willing to modify my driver install to test it.

$ cat t1825.cu
#include <stdio.h>

__global__ void k(const float ratio, const float density, const float remainder, const float gap){

//    float const ratio      = bias / ( bias + distance ) / orig;
//    float const base       = right + gap;
//    float const size       = base + gap;
//    float const remainder  = extended - size;
#ifndef USE_BUG
    float const sum        = remainder + gap;
    float const reciprocal = 1 / remainder;
    float const answer     = ratio * density * sum * reciprocal;
#else
    float const answer     = ratio * density * ( remainder + gap ) / remainder;
    // answer: 5000.5000000000 ratio: 1.0000000000 density: 0.1538460851
    // remainder: 14.0000000000 gap: 1.9870082140
#endif
    printf("answer: %f\n", answer);
}

int main(){


    //answer: 5000.5000000000
    const float ratio = 1.0000000000;
    const float  density = 0.1538460851;
    const float remainder =  14.0000000000;
    const float gap =  1.9870082140;
    k<<<1,1>>>(ratio, density, remainder, gap);
    cudaDeviceSynchronize();
}
$ /usr/local/cuda-9.2/bin/nvcc -arch=sm_70 -o t1825 t1825.cu
$ ./t1825
answer: 0.175681
$ /usr/local/cuda-9.2/bin/nvcc -arch=sm_70 -o t1825 t1825.cu -DUSE_BUG
$ ./t1825
answer: 0.175681
$

(driver 440.64.00, CentOS 7.x, Tesla V100-PCIE 32GB)

It’s entirely possible that there is a compiler bug of some sort. I’m not able to see it. As a general statement, bugs get fixed all the time, I often recommend that people use the latest toolchain(s) for this reason. You can also just file a bug on our public portal; instructions are linked to a sticky post at the top of this forum. It’s likely that if you do so, the bug handlers will ask you for the same things I asked you for at the top of my response here.

For what it is worth, not all invocations of my application evidence the bug, it is specific to particular patterns, deterministically occurring in about 1% of all variations. Perhaps the grid layout somehow comes into play?

Creating a stand-alone test case would be tricky.

Why is that?

Compiler bugs would typically manifest at higher optimizations levels. The default optimization level for nvcc is O3 last time I checked. A compiler bug seems less likely, and a bug in your code seems more likely.

Without a root cause analysis, this problem will come to back to bite you in the behind, and according to Murphy’s Law that will happen at the most inopportune time.

You might want to try a more recent CUDA version. That is typically the first thing the NVIDIA bug intake team would suggest when a bug is filed against an older CUDA version.

Switching to a newer version of Cuda is high on my list, however, given large number of our customers in the field with older GPU hardware, it is non-trivial for us to migrate to a later version.

Do I correctly understand that you are contending that -O3 workaround is most likely not avoiding the bug but pushing it around, only to pop up elsewhere?

Note that cuda-memcheck finds no issues with our code.

I spoke of likelihood as to root cause, not certainty. cuda-memcheck can find certain classes of bugs, it cannot find all bugs.

Without a root cause established, anything that seems to make an issue go away is not a proper workaround but the equivalent of voodoo magic. In my not so humble opinion.