Cube computing difference in GPU and CPU?

_arch=sm_20;

I write a simple cuda program for a test about cube computing.The result is strange and i don’t know why.
Here is my whole code:

#include<stdio.h>
#include<stdlib.h>
#include <string.h>
#include <ctype.h>
#include <time.h>
#include<math.h>

 void __global__ test(int N,double *dev_a,double *dev_b)
{
 int i = blockIdx.x * blockDim.x + threadIdx.x;
	if(i>=0 && i<N)
	{
		dev_b[i] = dev_a[i] + dev_a[i]*dev_a[i]*dev_a[i];
	}
}

int main()
{

double *a=NULL,*b=NULL,*b1=NULL;
double *dev_a,*dev_b;
double xrand1,xrand2,tx1;
int N=124;
int i=0,j=0;
int nthreads;

a=(double*)malloc(N*sizeof(double));
b=(double*)malloc(N*sizeof(double));
b1=(double*)malloc(N*sizeof(double));

cudaMalloc((void**)&dev_a, N*sizeof(double));	
cudaMalloc((void**)&dev_b, N*sizeof(double));   

srand(time(0)); 
for(j=1;j<10000;j++)
{

for(i=0; i<N; i++)
{   
        xrand1 = (1.0e0*rand())/RAND_MAX;
        tx1 = xrand1 ;
        a[i]=tx1;
        b[i]=0.0e0;
        b1[i]=0.0e0;
}

cudaMemcpy(dev_a , a, N*sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy(dev_b , b, N*sizeof(double), cudaMemcpyHostToDevice);

  nthreads = 64;
  test<<<(N+nthreads-1)/nthreads, nthreads>>>(N,dev_a,dev_b);


for(i=0;i<N;i++)
{
	b[i] = a[i] + a[i]*a[i]*a[i];

}

cudaMemcpy(b1, dev_b, N*sizeof(double), cudaMemcpyDeviceToHost);

for(i=0; i<N; i++)
{
	if(fabs(b[i]-b1[i])>= 1.0e-16)
	{
		printf("%d,%e\n",i,b[i]-b1[i]);
	}
}
	
	printf("%d\n",j);
	getchar();

}

free(a);
free(b);
free(b1);
cudaFree(dev_a);
cudaFree(dev_b);


return 0;

}

The error always exists.But when i change

dev_b[i] = dev_a[i] + dev_a[i]*dev_a[i]*dev_a[i];

into

dev_b[i] = dev_a[i];
  dev_b[i] += dev_a[i]*dev_a[i]*dev_a[i];

the error will disapper.

Hope someone explain this to me.

I assume the “error” you are referring to is a discrepancy between host and device calculations, and that when you make the code change, you get no discrepancy printout.

I’m unable to reproduce that observation (I get discrepancy printout in either case), but I’m not testing on a cc2.0 device and I am on linux, whereas it appears you are probably on windows.

My guess would be a difference in FMA contraction of operations between the two different device computation cases. Probably @njuffa would be able to give a better diagnosis.

Thanks for your reply!
I’m so sorry that i make a mistake about the change in my code.And it supposed to be

dev_b[i] = dev_a[i]*dev_a[i]*dev_a[i];
dev_b[i] += dev_a[i];

I am on linux too.When i add the option “-famd=false”,the error disappers.
But i do not know there are some others rules that may affect my result.I appreciate you can read my last topic,that problem is an example.

I think you meant -fmad not -famd

Yes, with that code change I can reproduce your observation.

The -fmad=false observation pretty much confirms the discrepancy in this case is due to FMA contraction of operations.

Usually, FMA contraction results in the same or better accuracy, not reduced accuracy, so my guess is that the CPU calculation is actually less accurate, and when you turn off FMA contraction/usage, the GPU result also becomes identically less accurate.

While the CUDA toolchain treats floating-point computation conservatively in general (no frequent re-associations that occur by default in various host compilers), it does have FMA contraction turned on by default because the FMA (fused multiply-add) is the central building block of the computational units of the GPU.

As already discussed, FMA contraction can be turned off by adding -fmad=false to the nvcc command line, but this will in general have a negative impact on performance and accuracy. This NVIDIA blog post has a link to an NVIDIA white paper on floating-point topics, including the use of FMA:

https://devblogs.nvidia.com/parallelforall/everything-you-ever-wanted-know-about-floating-point-were-afraid-ask/

Note that modern CPUs also include an FMA instruction, so if your host compiler supports FMA contraction, you might want to try turning that on. Generally speaking, one should not expect bit-wise matching floating-point results between any two platforms: results can differ for any number of reasons, including different compiler or library versions.