I keep running into cases where my Cuda application deterministically gets incorrect answers for floating-point math. Presently, I work around by tweaking the code a bit.
For example, the following code fails for me on certain testcases on a V100 card, but works on other GPU hardware with same testcase.
float const ratio = bias / ( bias + distance ) / orig; float const base = right + gap; float const size = base + gap; float const remainder = extended - size; #if 1 float const sum = remainder + gap; float const reciprocal = 1 / remainder; float const answer = ratio * density * sum * reciprocal; #else float const answer = ratio * density * ( remainder + gap ) / remainder; // answer: 5000.5000000000 ratio: 1.0000000000 density: 0.1538460851 // remainder: 14.0000000000 gap: 1.9870082140 #endif
As shown, the
#if 1 code works around the bug where the
#else shows the original code with a comment giving the incorrect answer computed.
Cuda compilation tools, release 9.2, V9.2.88 and the driver version on the machine that fails is
I’ve seen other examples of this on other hardware, using different versions of the driver, but the same version of the compiler, so my guess is a bug in the compiler.
Any ideas about what the likely culprit is or how to efficiently determine it? The compiler? The driver? The specific hardware instance of V100? Something else?