Floating point operations IEE compliance and debug mode

Hi all, I am currently using cuda on a 3.0 compute capabilities accelerator (GTX680) for computation of 32 bit floats number, and I discovered some problems when comparing GPU and CPU computation results.

The weird thing append when I play with nvcc compiling options: while using
“-gencode arch=compute_30,code=sm_30” option I always obtain a rounded result.
But when I use nvcc -G option (device debug that I don’t use because it returns error due to graphical processing is processed in the same time) the floating point computation return the same value as the CPU.

A very simple example of kernel that shows an error:

__global__ void calc(float* fpDeviceBuffer, float a, float b, unsigned short c, float d)

with Host code:

#include <stdio.h>
#include <stdlib.h>
#include <iostream>

#include <helper_cuda.h>

__global__ void calc(float* fpDeviceBuffer, float a, float b, unsigned short c, float d);

float a = 6708.0f;
float b = (4.0f - 0.25f)/( powf(2.0f, 16.0f) - 1.0f  );
unsigned short c = 17278;
float d = 0.250000f;
float *fpHostBuffer = NULL;
float *fpDeviceBuffer = NULL;

int main(int argc, char* argv[])
float result=a*(d+(c*b));
	//printf("CPU result %a\n",result);
	printf("CPU result %f\n",result);

	//page-locked memory

	//Device memory
	checkCudaErrors( cudaMemcpy ( fpHostBuffer, fpDeviceBuffer, sizeof(float), cudaMemcpyDeviceToHost) );

	printf("GPU result %a\n",*fpHostBuffer);
	printf("GPU result %f\n",*fpHostBuffer);



result is
GPU result 0x1.03a7fep+13
GPU result 8308.999023
GPU result 0x1.03a8p+13
GPU result 8309.000000

When no debug


GPU result 0x1.03a7fep+13
GPU result 8308.999023
GPU result 0x1.03a7fep+13
GPU result 8308.999023

When -G option is used

I haven’t found many informations related to this issue in the Nvidia cuda compiler documentation, any idea ?

The exact result is 8308.99954222934309910734721…

CPU deviation is -0.00051922…
GPU deviation is 0.00045777…

The GPU result (with debugging switched off) is closer to the exact result than the CPU result. Shocking, isn’t it?

Most likely a difference caused by the compler generating FMA (fused multiply-add) with the default build but not the debug build. Contraction of FMUL plus FADD sequence into FMA is an optimization, and for debug builds all optimizations are disabled. You can check this hypothesis by disabling the FMA merging with -fmad=false for all builds. I would also suggest reading the following whitepaper plus the references it cites:


The use of FMA often leads to results with improved accuracy, as tera points out. Use of FMA is also important for performance reasons, so you probably wouldn’t want to use -fmad=false for your production builds.

Thank you very much for this answer, the paper https://developer.nvidia.com/sites/default/files/akamai/cuda/files/NVIDIA-CUDA-Floating-Point.pdf was particularly usefull.

I do understand why I get those results now, it was interesting to see that using nvcc with -fmad false changed the precision, as with the -G option. (just don’t forget to clean the project before rebuild with nsight)